The field of speaker diarization is moving towards more innovative and efficient approaches, with a focus on improving speaker representation, robustness to varying conditions, and overall system performance. Recent research has explored the use of novel architectures, such as conformer decoders and transformer-updated attractors, as well as the application of text-based methods and mixture of experts. These advancements have led to significant improvements in diarization error and have demonstrated the potential for multimodal and semantic feature-based diarization. Noteworthy papers include:
- One that proposes a performant and compact diarization framework that integrates conformer decoders and transformer-updated attractors, achieving low diarization error while maintaining parameter count.
- Another that presents a novel text-based approach to speaker diarization, leveraging sentence-level speaker change detection within dialogues and demonstrating competitive performance against state-of-the-art audio-based systems.