Speech Processing Innovations

The field of speech processing is moving towards more integrated and robust approaches, with a focus on end-to-end models that can jointly perform multiple tasks such as speaker diarization, recognition, and separation. These models are being developed to handle real-world scenarios with varying numbers of speakers, noise levels, and speaker registration conditions. Noteworthy papers include SpeakerLM, which introduces a unified multimodal large language model for speaker diarization and recognition, and Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling, which presents a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings. Additionally, Noise-Robust Sound Event Detection and Counting via Language-Queried Sound Separation proposes a co-training-based multi-task learning framework for sound event detection and counting, and Advances in Speech Separation provides a comprehensive survey of DNN-based speech separation techniques.

Speech Processing Innovations

Sources