Advances in Adaptive Speech Recognition and Synthesis

The field of speech recognition and synthesis is witnessing significant developments, with a focus on improving robustness and adaptability in diverse domains and languages. Researchers are exploring innovative approaches to combine test-time adaptation and language model rescoring, as well as leveraging self-refining frameworks and synthetic data to enhance speech recognition performance. Another notable trend is the use of asynchronous text-speech adaptation and zero-shot text-to-speech models to improve code-switched speech recognition and short-utterance speaker verification. Furthermore, closed-loop corpus optimization frameworks are being proposed to construct multi-speaker text-to-speech systems from noisy, uncurated web-scale speech data. Noteworthy papers include SUTA-LM, which achieves robust results across a wide range of domains by effectively combining test-time adaptation and language model rescoring. The Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data is also notable, as it provides a compelling alternative to pseudo-labeling self-distillation approaches and offers a practical pathway for improving ASR performance in low-resource or domain-specific settings.

Sources

SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

Built with on top of