The field of speaker recognition and diarization is moving towards more accurate and robust methods for handling complex audio signals, including those with strong background noise, reverberation, and overlapping speech. Researchers are exploring novel approaches to disentangle linguistic and speaker information, such as prefix-tuned cross-attention and graph attention networks. These advances have led to significant improvements in speaker recognition accuracy and diarization error rates. Notably, the use of multimodal information, including audio-visual signals, is becoming increasingly important for achieving state-of-the-art results. Noteworthy papers include: LASPA, which proposes a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention, and CASA-Net, which introduces an embedding fusion method designed for end-to-end audio-visual speaker diarization systems. These papers demonstrate the potential for innovative methods to advance the field and achieve significant performance gains.