The field of speech synthesis and recognition is rapidly advancing with innovative approaches to improve controllability, efficiency, and accuracy. Recent developments focus on addressing limitations in existing models, such as constrained control and uncontrollable variation in text-to-speech synthesis. Novel fine-tuning regimes and active learning methods are being explored to enhance model performance and adaptability. Additionally, there is a growing emphasis on creating high-quality datasets for low-resource languages and improving speech emotion recognition. These advancements have the potential to significantly impact various applications, including speech-to-speech translation, speech separation, and automatic modulation recognition. Noteworthy papers include RepeaTTS, which proposes a novel fine-tuning regime to discover latent features and improve model controllability. Another notable work is Active Learning for Text-to-Speech Synthesis with Informative Sample Collection, which presents an active learning approach to construct data-efficient corpora. The NonverbalTTS dataset is also a significant contribution, providing a public English corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech synthesis.
Advances in Speech Synthesis and Recognition
Sources
DUSE: A Data Expansion Framework for Low-resource Automatic Modulation Recognition based on Active Learning
Best Practices and Considerations for Child Speech Corpus Collection and Curation in Educational, Clinical, and Forensic Scenarios