Advances in Speech Recognition and Diarization

The field of speech recognition and diarization is witnessing significant developments, driven by the creation of large, diverse, and realistic datasets. These datasets are designed to mimic real-world scenarios, including noisy environments, overlapping speech, and diverse accents. As a result, researchers are now able to develop and evaluate more robust and generalizable models. A key trend in the field is the focus on multidisciplinary approaches, combining advances in speech recognition, speaker diarization, and source separation to improve overall system performance. Notable papers in this area include:

  • Loquacious Set, which presents a 25,000-hour curated collection of commercially usable English speech, and
  • UniTalk, which introduces a novel dataset for active speaker detection in real-world scenarios, and
  • AISHELL-5, the first open-source in-car multi-channel multi-speaker speech dataset for automatic speech diarization and recognition.

Sources

Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM

Built with on top of