Advancements in Speaker Diarization and Multilingual Speech Recognition

The field of speech recognition is moving towards more advanced and accurate methods of speaker diarization and multilingual speech recognition. Recent studies have focused on improving the accuracy of speaker diarization in noisy environments, such as classrooms, and exploring the use of multi-modal approaches that incorporate video and audio data. Additionally, there is a growing interest in developing more efficient and cost-effective methods for multilingual speech recognition, including the use of selective invocation and back-translation techniques. These advancements have the potential to significantly improve the accuracy and accessibility of speech recognition systems. Noteworthy papers include: The Multimodal Information Based Speech Processing 2025 Challenge, which achieved significant improvements in audio-visual speaker diarization and recognition. From Tens of Hours to Tens of Thousands, which introduced a scalable pipeline for improving multilingual ASR models through speech back-translation.

Sources

Multi-Stage Speaker Diarization for Noisy Classrooms

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

Built with on top of