The field of speech recognition is moving towards more advanced and accurate methods of speaker diarization and multilingual speech recognition. Recent studies have focused on improving the accuracy of speaker diarization in noisy environments, such as classrooms, and exploring the use of multi-modal approaches that incorporate video and audio data. Additionally, there is a growing interest in developing more efficient and cost-effective methods for multilingual speech recognition, including the use of selective invocation and back-translation techniques. These advancements have the potential to significantly improve the accuracy and accessibility of speech recognition systems. Noteworthy papers include: The Multimodal Information Based Speech Processing 2025 Challenge, which achieved significant improvements in audio-visual speaker diarization and recognition. From Tens of Hours to Tens of Thousands, which introduced a scalable pipeline for improving multilingual ASR models through speech back-translation.
Advancements in Speaker Diarization and Multilingual Speech Recognition
Sources
The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages