Multimodal Learning and Vision-Language Models

The field of multimodal learning is moving towards more effective integration of visual and linguistic information, with a focus on leveraging non-manual cues, modal asymmetry, and joint optimization techniques. Recent developments have shown that incorporating mouthing cues, modal asymmetry, and hierarchical contrastive learning can significantly improve the accuracy of sign language translation and vision-language models. Additionally, novel architectures and training methods, such as Mixture of Experts and Decoupled Proxy Alignment, have been proposed to address challenges in multimodal learning, including language prior conflict and inter-task optimization conflicts. Noteworthy papers include: SignClip, which proposes a novel framework for sign language translation that fuses manual and non-manual cues, and achieves state-of-the-art results on benchmark datasets. AsyMoE, which introduces a novel architecture that models modal asymmetry and achieves significant accuracy improvements over existing Mixture of Experts approaches. MEJO, which proposes a framework for surgical triplet recognition that empowers both inter- and intra-task optimization and achieves superior results on benchmark datasets. LLM-JEPA, which develops a Joint Embedding Predictive Architecture for Large Language Models and achieves significant performance improvements across numerous datasets and models. Decoupled Proxy Alignment, which proposes a novel training method that mitigates language prior conflict and achieves superior alignment performance across diverse datasets and models. EchoVLM, which proposes a vision-language model specifically designed for ultrasound medical imaging and achieves significant improvements in diagnostic accuracy.

Sources

SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLM

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

Built with on top of