Advances in Multimodal Learning and Audio-Visual Understanding

The field of multimodal learning and audio-visual understanding is rapidly evolving, with a focus on developing more effective and efficient methods for integrating multiple sources of information. Recent work has explored the use of latent space broadening and mid-fusion approaches to enhance the performance of vision-language models, particularly in the presence of audio data. Additionally, there is a growing interest in leveraging audio representations for vibration-based crowd monitoring and in developing more robust methods for multi-talker automatic speech recognition. The use of multimodal fusion techniques, such as weighted reciprocal rank fusion, has also shown promise in improving the performance of video retrieval systems. Noteworthy papers in this area include:

  • Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion, which proposes a novel approach to enhancing the performance of vision-language models using latent space broadening and audio enhancement.
  • Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition, which achieves state-of-the-art results in multi-talker ASR by decoupling the training of the speaker separation frontend and the ASR backend.

Sources

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Leveraging Audio Representations for Vibration-Based Crowd Monitoring in Stadiums

Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Counting How the Seconds Count: Understanding Algorithm-User Interplay in TikTok via ML-driven Analysis of Video Content

MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion

VideoMix: Aggregating How-To Videos for Task-Oriented Learning

Hierarchical Label Propagation: A Model-Size-Dependent Performance Booster for AudioSet Tagging

Built with on top of