Advances in Speech Recognition and Synthesis

The field of speech recognition and synthesis is moving towards more integrated and end-to-end approaches, leveraging large language models and self-supervised learning to improve performance and equity. Recent work has focused on developing more robust and adaptable models that can handle diverse speaking styles, languages, and emotional expressions. Notable advancements include the development of proficiency-aware adaptation and data augmentation strategies for automatic speech recognition, as well as novel architectures for cross-lingual emotion text-to-speech synthesis and non-parallel voice conversion.

Noteworthy papers include: Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR, which proposes strategies to reduce disparities in ASR performance for non-native speakers. Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS achieves superior naturalness, emotion transferability, and timbre consistency across languages. StressTransfer substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. RLAIF-SPA optimizes LLM-based emotional speech synthesis via reinforcement learning, achieving a 26.1% reduction in WER and over 10% improvement in human evaluation.

Sources

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker

VCTR: A Transformer-Based Model for Non-parallel Voice Conversion

StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Built with on top of