The field of speech recognition and synthesis is moving towards more integrated and end-to-end approaches, leveraging large language models and self-supervised learning to improve performance and equity. Recent work has focused on developing more robust and adaptable models that can handle diverse speaking styles, languages, and emotional expressions. Notable advancements include the development of proficiency-aware adaptation and data augmentation strategies for automatic speech recognition, as well as novel architectures for cross-lingual emotion text-to-speech synthesis and non-parallel voice conversion.
Noteworthy papers include: Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR, which proposes strategies to reduce disparities in ASR performance for non-native speakers. Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS achieves superior naturalness, emotion transferability, and timbre consistency across languages. StressTransfer substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. RLAIF-SPA optimizes LLM-based emotional speech synthesis via reinforcement learning, achieving a 26.1% reduction in WER and over 10% improvement in human evaluation.