The field of audio research is moving towards a more comprehensive understanding of multimodal audio, incorporating both acoustic and linguistic information. Recent studies have focused on developing large-scale datasets and models that can effectively process and generate spatial audio, leading to improved performance in tasks such as audio spatialization and sound event localization. Additionally, there is a growing interest in exploring the relationship between language and audio, with joint language-audio embedding models being evaluated for their ability to capture perceptual dimensions of timbre. Noteworthy papers include: MRSAudio, which introduces a large-scale multimodal spatial audio dataset, and Do Audio LLMs Really LISTEN, or Just Transcribe?, which presents a controlled benchmark to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Other notable works include LSZone, which proposes a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation, and Beyond Discrete Categories, which introduces a continuous Valence-Arousal model for pet vocalization analysis. These innovative approaches are advancing the field and enabling more accurate and robust audio understanding and generation.
Advances in Multimodal Audio Understanding
Sources
Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance
LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation