Advances in Speaker Diarization

The field of speaker diarization is moving towards more innovative and efficient approaches, with a focus on improving speaker representation, robustness to varying conditions, and overall system performance. Recent research has explored the use of novel architectures, such as conformer decoders and transformer-updated attractors, as well as the application of text-based methods and mixture of experts. These advancements have led to significant improvements in diarization error and have demonstrated the potential for multimodal and semantic feature-based diarization. Noteworthy papers include:

  • One that proposes a performant and compact diarization framework that integrates conformer decoders and transformer-updated attractors, achieving low diarization error while maintaining parameter count.
  • Another that presents a novel text-based approach to speaker diarization, leveraging sentence-level speaker change detection within dialogues and demonstrating competitive performance against state-of-the-art audio-based systems.

Sources

End-to-End Diarization utilizing Attractor Deep Clustering

Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Dissecting the Segmentation Model of End-to-End Diarization with Vector Clustering

Exploring Speaker Diarization with Mixture of Experts

Built with on top of