Speech Processing Innovations

The field of speech processing is witnessing significant advancements with a focus on improving speech recognition and synthesis capabilities. Researchers are exploring novel approaches to fine-tune existing models for underrepresented languages, resulting in substantial performance improvements. The development of open-science speech foundation models is also gaining traction, enabling reproducibility and fair evaluation. Furthermore, innovations in speech synthesis are leading to more natural and consistent outputs, with models being scaled up to handle multilingual and diverse datasets. Noteworthy papers include: Swedish Whispers, which achieved a 47% reduction in WER for Swedish speech recognition. CosyVoice 3, which introduced a novel speech tokenizer and differentiable reward model for post-training, resulting in enhanced performance on multilingual benchmarks. FAMA, the first large-scale open-science speech foundation model for English and Italian, promoting openness in speech technology research.

Speech Processing Innovations

Sources