Speech Processing Innovations

The field of speech processing is moving towards more integrated and robust approaches, with a focus on end-to-end models that can jointly perform multiple tasks such as speaker diarization, recognition, and separation. These models are being developed to handle real-world scenarios with varying numbers of speakers, noise levels, and speaker registration conditions. Noteworthy papers include SpeakerLM, which introduces a unified multimodal large language model for speaker diarization and recognition, and Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling, which presents a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings. Additionally, Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation proposes a co-training-based multi-task learning framework for sound event detection and counting, and Advances in Speech Separation provides a comprehensive survey of DNN-based speech separation techniques.

Sources

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation

Multi-Target Backdoor Attacks Against Speaker Recognition

Advances in Speech Separation: Techniques, Challenges, and Future Trends

Built with on top of