Advances in Text-to-Speech Synthesis and Speech Generation

The field of text-to-speech synthesis and speech generation is rapidly evolving, with a focus on improving the naturalness, expressivity, and controllability of generated speech. Recent developments have seen a shift towards more sophisticated models that can capture the complexities of human speech, including prosody, tone, and style. One of the key trends in this area is the use of large language models (LLMs) to improve the accuracy and coherence of generated speech. Another important direction is the development of more efficient and effective methods for training and evaluating speech generation models, including the use of masked audio parallel inference and streaming retrieval-augmented generation. Noteworthy papers in this area include Comprehend and Talk, which proposes a novel framework for robust and semantically-grounded zero-shot synthesis, and VoxCPM, which introduces a tokenizer-free TTS model that achieves state-of-the-art zero-shot TTS performance. Additionally, HiStyle presents a hierarchical style embedding predictor for controllable speech synthesis, and BatonVoice proposes a new paradigm for enhancing controllable speech synthesis with linguistic intelligence from LLMs. MOSS-Speech and Stream RAG also present innovative approaches to speech-to-speech models and spoken dialogue systems, respectively.

Sources

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Built with on top of