The field of conversational speech synthesis and interaction is rapidly advancing, with a focus on creating more natural and interactive speech systems. Recent developments have centered around improving the prosody and expressiveness of synthesized speech, as well as enabling more seamless and efficient interaction between humans and machines. Notable advancements include the development of new datasets and models that can capture the nuances of human conversation, such as fine-grained semantic and prosodic interaction modeling. Additionally, there is a growing trend towards creating more open and accessible systems, with many researchers releasing open-source datasets and code to facilitate further research. Some particularly noteworthy papers in this area include: FireRedTTS-2, which presents a long-form streaming TTS system for multi-speaker dialogue generation, and FLM-Audio, which proposes a novel dual training paradigm for building full-duplex spoken dialog models. WenetSpeech-Yue is also notable for releasing a large-scale Cantonese speech corpus with multi-dimensional annotation, which can help to improve ASR and TTS performance for this language. Overall, the field is moving towards creating more sophisticated and human-like speech systems that can interact with users in a more natural and intuitive way.
Conversational Speech Synthesis and Interaction
Sources
OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios