The field of audio processing and speech recognition is moving towards more efficient and robust models. Recent developments have focused on improving the quality of audio recordings, enhancing speech recognition systems, and exploring new applications for audio processing techniques. Notable advancements include the development of compact single-stage models for vocal restoration, multi-scale alignment methods for non-autoregressive speech recognition, and neural audio codecs for spatial audio. These innovations have the potential to improve various downstream audio tasks, such as speech enhancement, source separation, and audio generation.
Some noteworthy papers in this area include: Smule Renaissance Small, which presents a compact single-stage model for vocal restoration that outperforms strong baselines on the DNS 5 Challenge. M-CIF, which proposes a multi-scale alignment method for non-autoregressive speech recognition that reduces word error rates on several datasets. FOA Tokenizer, which introduces a neural audio codec for first-order ambisonics that preserves directional cues in reconstructed signals. POWSM, which presents a unified framework for phonetic tasks such as automatic speech recognition and phone recognition. Explainable Disentanglement, which proposes a method for disentangling semantic speech content from background noise in discrete speech representations. Efficient Vocal Source Separation, which replaces full temporal self-attention with windowed sink attention to reduce computational costs. UniTok-Audio, which proposes a unified audio generation framework via generative modeling on discrete codec tokens.