Conversational Speech Synthesis and Interaction

The field of conversational speech synthesis and interaction is rapidly advancing, with a focus on creating more natural and interactive speech systems. Recent developments have centered around improving the prosody and expressiveness of synthesized speech, as well as enabling more seamless and efficient interaction between humans and machines. Notable advancements include the development of new datasets and models that can capture the nuances of human conversation, such as fine-grained semantic and prosodic interaction modeling. Additionally, there is a growing trend towards creating more open and accessible systems, with many researchers releasing open-source datasets and code to facilitate further research. Some particularly noteworthy papers in this area include: FireRedTTS-2, which presents a long-form streaming TTS system for multi-speaker dialogue generation, and FLM-Audio, which proposes a novel dual training paradigm for building full-duplex spoken dialog models. WenetSpeech-Yue is also notable for releasing a large-scale Cantonese speech corpus with multi-dimensional annotation, which can help to improve ASR and TTS performance for this language. Overall, the field is moving towards creating more sophisticated and human-like speech systems that can interact with users in a more natural and intuitive way.

Sources

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot

FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Open-Source Full-Duplex Conversational Datasets for Natural and Interactive Speech Synthesis

OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios

Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis

FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations

ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data

CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

Built with on top of