Bridging Modality Gaps in Audio-Text Research

The field of audio-text research is moving towards bridging the modality gap between audio and text embeddings, enabling more effective coupling between multimodal encoders and large language models. This is being achieved through innovative approaches such as diffusion-based modality bridging, dual alignment of audio and language, and cross-modal attention mechanisms. These advancements have led to state-of-the-art results in applications such as automatic audio captioning, audio-to-visual generation, and audio-visual navigation. Notable papers in this area include: Diffusion-Link, which reduces the modality gap and achieves state-of-the-art results in automatic audio captioning. SeeingSounds, which introduces a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision. UALM, which unifies audio understanding, text-to-audio generation, and multimodal reasoning in a single model, demonstrating cross-modal generative reasoning.

Bridging Modality Gaps in Audio-Text Research

Sources