Speech Processing Innovations

The field of speech processing is witnessing significant advancements with a focus on improving speech recognition and synthesis capabilities. Researchers are exploring novel approaches to fine-tune existing models for underrepresented languages, resulting in substantial performance improvements. The development of open-science speech foundation models is also gaining traction, enabling reproducibility and fair evaluation. Furthermore, innovations in speech synthesis are leading to more natural and consistent outputs, with models being scaled up to handle multilingual and diverse datasets. Noteworthy papers include: Swedish Whispers, which achieved a 47% reduction in WER for Swedish speech recognition. CosyVoice 3, which introduced a novel speech tokenizer and differentiable reward model for post-training, resulting in enhanced performance on multilingual benchmarks. FAMA, the first large-scale open-science speech foundation model for English and Italian, promoting openness in speech technology research.

Sources

Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Preliminary Characterization of Bio-inspired Dog-Nose Sampler for Aerosol Detection

Voice Adaptation for Swiss German

FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

ZIPA: A family of efficient models for multilingual phone recognition

Built with on top of