Advances in Speaker Recognition and Diarization

The field of speaker recognition and diarization is moving towards more accurate and robust methods for handling complex audio signals, including those with strong background noise, reverberation, and overlapping speech. Researchers are exploring novel approaches to disentangle linguistic and speaker information, such as prefix-tuned cross-attention and graph attention networks. These advances have led to significant improvements in speaker recognition accuracy and diarization error rates. Notably, the use of multimodal information, including audio-visual signals, is becoming increasingly important for achieving state-of-the-art results. Noteworthy papers include: LASPA, which proposes a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention, and CASA-Net, which introduces an embedding fusion method designed for end-to-end audio-visual speaker diarization systems. These papers demonstrate the potential for innovative methods to advance the field and achieve significant performance gains.

Sources

Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge

LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention

Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

Built with on top of