The field of text-to-speech synthesis and speech generation is rapidly evolving, with a focus on improving the naturalness, expressivity, and controllability of generated speech. Recent developments have seen a shift towards more sophisticated models that can capture the complexities of human speech, including prosody, tone, and style. One of the key trends in this area is the use of large language models (LLMs) to improve the accuracy and coherence of generated speech. Another important direction is the development of more efficient and effective methods for training and evaluating speech generation models, including the use of masked audio parallel inference and streaming retrieval-augmented generation. Noteworthy papers in this area include Comprehend and Talk, which proposes a novel framework for robust and semantically-grounded zero-shot synthesis, and VoxCPM, which introduces a tokenizer-free TTS model that achieves state-of-the-art zero-shot TTS performance. Additionally, HiStyle presents a hierarchical style embedding predictor for controllable speech synthesis, and BatonVoice proposes a new paradigm for enhancing controllable speech synthesis with linguistic intelligence from LLMs. MOSS-Speech and Stream RAG also present innovative approaches to speech-to-speech models and spoken dialogue systems, respectively.
Advances in Text-to-Speech Synthesis and Speech Generation
Sources
LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis