Advances in Speech Synthesis and Recognition

The field of speech synthesis and recognition is rapidly advancing with innovative approaches to improve controllability, efficiency, and accuracy. Recent developments focus on addressing limitations in existing models, such as constrained control and uncontrollable variation in text-to-speech synthesis. Novel fine-tuning regimes and active learning methods are being explored to enhance model performance and adaptability. Additionally, there is a growing emphasis on creating high-quality datasets for low-resource languages and improving speech emotion recognition. These advancements have the potential to significantly impact various applications, including speech-to-speech translation, speech separation, and automatic modulation recognition. Noteworthy papers include RepeaTTS, which proposes a novel fine-tuning regime to discover latent features and improve model controllability. Another notable work is Active Learning for Text-to-Speech Synthesis with Informative Sample Collection, which presents an active learning approach to construct data-efficient corpora. The NonverbalTTS dataset is also a significant contribution, providing a public English corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech synthesis.

Sources

RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

Active Learning for Text-to-Speech Synthesis with Informative Sample Collection

Unlocking Speech Instruction Data Potential with Query Rewriting

BENYO-S2ST-Corpus-1: A Bilingual English-to-Yoruba Direct Speech-to-Speech Translation Corpus

Ensemble Confidence Calibration for Sound Event Detection in Open-environment

THAI Speech Emotion Recognition (THAI-SER) corpus

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

FasTUSS: Faster Task-Aware Unified Source Separation

DUSE: A Data Expansion Framework for Low-resource Automatic Modulation Recognition based on Active Learning

EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis

Towards few-shot isolated word reading assessment

Best Practices and Considerations for Child Speech Corpus Collection and Curation in Educational, Clinical, and Forensic Scenarios

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

Automatically assessing oral narratives of Afrikaans and isiXhosa children