Submit media inputs to generate text and speech responses