Advances in Speech Translation and Audio Coding

The field of speech processing is witnessing significant advancements, with a focus on improving speech translation, audio coding, and speech recognition. Researchers are exploring innovative architectures and techniques to enhance the efficiency and accuracy of these systems. Notably, there is a growing interest in developing speech translation models that can handle complex linguistic structures and nuances, such as idiom translation and speaker gender identification. Additionally, advancements in audio coding are enabling the development of high-fidelity neural audio codecs that can compress speech and music effectively.

Some noteworthy papers in this area include: HENT-SRT, which proposes a novel framework for joint speech recognition and translation using a hierarchical efficient neural transducer with self-distillation. SwitchCodec, which introduces a high-fidelity neural audio codec with sparse quantization, achieving improved performance while maintaining low latency. MFLA, which presents a novel prefix-to-prefix training framework for streaming speech recognition using monotonic finite look-ahead attention.

Sources

BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation System

Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Efficient Neural and Numerical Methods for High-Quality Online Speech Spectrogram Inversion via Gradient Theorem

Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding

HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation

Different Speech Translation Models Encode and Translate Speaker Gender Differently

On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs

Towards a Japanese Full-duplex Spoken Dialogue System

It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems

Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments

MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

Conformer-based Ultrasound-to-Speech Conversion

From Spikes to Speech: NeuroVoc -- A Biologically Plausible Vocoder Framework for Auditory Perception and Cochlear Implant Simulation