Advancements in Audio Processing and Speech Recognition

The field of audio processing and speech recognition is moving towards more efficient and robust models. Recent developments have focused on improving the quality of audio recordings, enhancing speech recognition systems, and exploring new applications for audio processing techniques. Notable advancements include the development of compact single-stage models for vocal restoration, multi-scale alignment methods for non-autoregressive speech recognition, and neural audio codecs for spatial audio. These innovations have the potential to improve various downstream audio tasks, such as speech enhancement, source separation, and audio generation.

Some noteworthy papers in this area include: Smule Renaissance Small, which presents a compact single-stage model for vocal restoration that outperforms strong baselines on the DNS 5 Challenge. M-CIF, which proposes a multi-scale alignment method for non-autoregressive speech recognition that reduces word error rates on several datasets. FOA Tokenizer, which introduces a neural audio codec for first-order ambisonics that preserves directional cues in reconstructed signals. POWSM, which presents a unified framework for phonetic tasks such as automatic speech recognition and phone recognition. Explainable Disentanglement, which proposes a method for disentangling semantic speech content from background noise in discrete speech representations. Efficient Vocal Source Separation, which replaces full temporal self-attention with windowed sink attention to reduce computational costs. UniTok-Audio, which proposes a unified audio generation framework via generative modeling on discrete codec tokens.

Sources

Smule Renaissance Small: Efficient General-Purpose Vocal Restoration

M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR

FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss

Low-Resource Audio Codec (LRAC): 2025 Challenge Description

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Efficient Vocal Source Separation Through Windowed Sink Attention

Modeling strategies for speech enhancement in the latent space of a neural audio codec

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Built with on top of