Advances in Multimodal Speech Interaction

The field of multimodal speech interaction is moving towards seamless and smart speech interaction with adaptive modality-specific approaches. Researchers are focusing on developing frameworks that can integrate speech and text generation capabilities, preserving richer paralinguistic features such as emotion and prosody. Noteworthy papers include DeepTalk, which proposes a framework for adaptive modality expert learning based on a Mixture of Experts architecture, and WildSpeech-Bench, which presents a novel approach to thoroughly evaluate LLMs in practical speech conversations. Other notable works include RELATE, a subjective evaluation dataset for automatic evaluation of relevance between text and audio, and JoyTTS, an end-to-end spoken chatbot that combines large language models with text-to-speech technology and features voice cloning capabilities.

Sources

DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio

A Dataset for Automatic Assessment of TTS Quality in Spanish

JoyTTS: LLM-based Spoken Chatbot With Voice Cloning

Built with on top of