Advances in Multimodal Learning

The field of multimodal learning is moving towards a more unified and coherent understanding of relationships across different modalities. Recent studies have explored the consistency of any-to-any models and the challenges of injecting evolving knowledge into large language and multimodal models. The results have shown that while any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models, they do exhibit weak but observable consistency through structured analyses of the intermediate latent space. Furthermore, the injection of evolving knowledge into multimodal models has been found to be challenging, with existing methods performing poorly and supervised fine-tuning causing catastrophic forgetting. However, text knowledge augmentation and continual learning methods have been shown to be effective in mitigating these challenges. Notable papers include: Seeing What Tastes Good, which investigates how well large-scale models represent semantic feature norms of concrete object concepts. Quantifying Cross-Modality Memorization in Vision-Language Models, which conducts a systematic study of cross-modality memorization in vision-language models and proposes a baseline method to mitigate the challenge.

Sources

Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways

Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

The mutual exclusivity bias of bilingual visually grounded speech models

Quantifying Cross-Modality Memorization in Vision-Language Models

Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Built with on top of