Bridging Modality Gaps in Audio-Text Research

The field of audio-text research is moving towards bridging the modality gap between audio and text embeddings, enabling more effective coupling between multimodal encoders and large language models. This is being achieved through innovative approaches such as diffusion-based modality bridging, dual alignment of audio and language, and cross-modal attention mechanisms. These advancements have led to state-of-the-art results in applications such as automatic audio captioning, audio-to-visual generation, and audio-visual navigation. Notable papers in this area include: Diffusion-Link, which reduces the modality gap and achieves state-of-the-art results in automatic audio captioning. SeeingSounds, which introduces a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision. UALM, which unifies audio understanding, text-to-audio generation, and multimodal reasoning in a single model, demonstrating cross-modal generative reasoning.

Sources

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

SeeingSounds: Learning Audio-to-Visual Alignment via Text

Audio-Guided Visual Perception for Audio-Visual Navigation

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Multi Agent Switching Mode Controller for Sound Source localization

Built with on top of