Advancements in Audio-Visual Learning and Multimodal Interaction

The field of audio-visual learning and multimodal interaction is rapidly advancing, with a focus on improving the alignment and synchronization between different modalities. Recent developments have led to the creation of more sophisticated models that can effectively capture fine-grained temporal correspondences between audio and visual frames, enabling better representation learning and transferability across tasks. Notable advancements include the use of contrastive learning, generative models, and large language models to improve audio-visual understanding, speech processing, and human-computer interaction.

A key direction in this field is the development of models that can learn from multiple sources of information, including audio, vision, and text, to create more robust and generalizable representations. This has led to significant improvements in tasks such as audio-visual speech recognition, lip-syncing, and audio source separation.

Another important area of research is the creation of more efficient and effective models for real-time audio-visual processing, enabling applications such as live speech translation, voice-controlled interfaces, and immersive multimedia experiences.

Some noteworthy papers in this area include: CAV-MAE Sync, which proposes a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning, achieving state-of-the-art performance on zero-shot retrieval, classification, and localization tasks. OpenAVS, which introduces a novel training-free language-based approach for open-vocabulary audio-visual segmentation, demonstrating superior performance on benchmark datasets and achieving significant gains in mIoU and F-score. Voila, which presents a family of large voice-language foundation models that enable full-duplex, low-latency conversations while preserving rich vocal nuances, and achieves a response latency of just 195 milliseconds.

Sources

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting

Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

BLAB: Brutally Long Audio Bench

CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond

FLAM: Frame-Wise Language-Audio Modeling

Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

Built with on top of