The field of audio processing and generation is rapidly advancing, with a focus on developing innovative methods for speech enhancement, audio compression, and generation. Recent research has explored the use of deep learning techniques, such as transformer-based architectures and autoregressive models, to improve the quality and efficiency of audio processing tasks. Notably, the development of latent bridge models and multi-channel autoregression on spectrograms has shown promising results for speech restoration and audio generation. Additionally, the use of dynamic neural audio codecs and unified single-codebook neural codecs has improved the performance of audio compression and reconstruction. Overall, these advances have the potential to significantly impact various applications, including speech recognition, music processing, and audio post-production.
Noteworthy papers include: VoiceBridge, which introduces a latent bridge model for general speech restoration at scale, achieving superior performance across various tasks and datasets. MARS, which proposes a framework for audio generation via multi-channel autoregression on spectrograms, demonstrating efficient and scalable high-fidelity audio generation. FlexiCodec, which presents a dynamic neural audio codec for low frame rates, improving semantic preservation and audio reconstruction quality. MelCap, which introduces a unified single-codebook neural codec for high-fidelity audio compression, achieving performance comparable to state-of-the-art multi-codebook codecs.