Advances in Audio Intelligence and Multimodal Interaction

The field of audio intelligence and multimodal interaction is rapidly advancing, with a focus on developing more sophisticated and human-like models. Recent research has explored the use of large audio-language models, discrete diffusion modeling, and dynamic parameter memory to improve speech emotion recognition, audio inpainting, and multimodal understanding. These innovations have led to significant improvements in tasks such as audio classification, speech recognition, and emotion recognition. Notably, the development of benchmarks like ProactiveBench and MultiVox has enabled more comprehensive evaluations of multimodal models. Furthermore, advances in audio coding and quantization have facilitated more efficient compression and transmission of audio data. Overall, the field is moving towards more integrated and multimodal approaches, enabling more natural and human-like interactions between humans and machines.

Noteworthy papers include: Audio Flamingo 3, which introduces a fully open state-of-the-art large audio-language model that advances reasoning and understanding across speech, sound, and music. Audio Inpanting using Discrete Diffusion Model, which presents a novel inpainting method based on discrete diffusion modeling that operates over tokenized audio representations. ProactiveBench, which introduces the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Voxtral, which presents two multimodal audio chat models that achieve state-of-the-art performance across a diverse range of audio benchmarks.

Sources

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio Inpanting using Discrete Diffusion Model

Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation

ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

AudioMAE++: learning better masked audio representations with SwiGLU FFNs

MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

Improving Neural Pitch Estimation with SWIPE Kernels

Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

Autoregressive Speech Enhancement via Acoustic Tokens

Multi-Class-Token Transformer for Multitask Self-supervised Music Information Retrieval

Voxtral

Built with on top of