Multimodal Learning Advancements

The field of multimodal learning is moving towards improving the alignment and representation of different modalities, such as vision and language. Recent developments focus on enhancing the robustness and effectiveness of multimodal large language models (MLLMs) in various tasks, including visual instruction understanding and aspect-based sentiment analysis. Notable advancements include the use of machine teaching to evaluate the complexity of teaching MLLMs and the introduction of novel approaches to maximize cross-modal mutual information and prevent modality collapse. These innovative methods have shown significant improvements in performance and efficiency, paving the way for new directions in multimodal research. Some particularly noteworthy papers include:

  • VISTA, which enhances vision-text alignment in MLLMs via cross-modal mutual information maximization.
  • Visual Instruction Bottleneck Tuning, which improves the robustness of MLLMs under distribution shifts.
  • RepBlend, a novel MDD framework that alleviates modality collapse by representation blending and symmetric projection trajectory matching.

Sources

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

Visual Instruction Bottleneck Tuning

Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Contrastive Learning-Enhanced Trajectory Matching for Small-Scale Dataset Distillation

Built with on top of