Advances in Speech Recognition and Processing

The field of speech recognition and processing is rapidly advancing, driven by innovations in deep learning and large language models. One of the key trends is the development of more efficient and effective architectures for speech recognition, such as the use of dynamic thinking mechanisms and end-to-end approaches. These models are able to learn complex patterns in speech data and achieve state-of-the-art performance on a range of tasks, including speech recognition, speech synthesis, and speech translation. Another area of research is the application of speech recognition and processing to real-world problems, such as hearing assessment, speech rehabilitation, and emotion recognition. Noteworthy papers in this area include SALMONN-omni, which introduces a novel standalone speech LLM for full-duplex conversation, and UniTTS, which proposes an end-to-end TTS system without decoupling of acoustic and semantic information. Reverse-Speech-Finder is also an interesting paper that introduces a neural network backtracking architecture for generating Alzheimer's disease speech samples and improving diagnosis performance. Overall, the field of speech recognition and processing is making rapid progress, with new architectures, techniques, and applications being developed and explored.

Sources

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Improving endpoint detection in end-to-end streaming ASR for conversational speech

An End-to-End Approach for Child Reading Assessment in the Xhosa Language

LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Reverse-Speech-Finder: A Neural Network Backtracking Architecture for Generating Alzheimer's Disease Speech Samples and Improving Diagnosis Performance

Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

Toward Optimal ANC: Establishing Mutual Information Lower Bound

GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task

Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR

Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis

Effective Context in Neural Speech Models

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation

Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection