The field of multimodal learning and audio-visual understanding is rapidly evolving, with a focus on developing more effective and efficient methods for integrating multiple sources of information. Recent work has explored the use of latent space broadening and mid-fusion approaches to enhance the performance of vision-language models, particularly in the presence of audio data. Additionally, there is a growing interest in leveraging audio representations for vibration-based crowd monitoring and in developing more robust methods for multi-talker automatic speech recognition. The use of multimodal fusion techniques, such as weighted reciprocal rank fusion, has also shown promise in improving the performance of video retrieval systems. Noteworthy papers in this area include:
- Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion, which proposes a novel approach to enhancing the performance of vision-language models using latent space broadening and audio enhancement.
- Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition, which achieves state-of-the-art results in multi-talker ASR by decoupling the training of the speaker separation frontend and the ASR backend.