The field of multimodal learning is moving towards improving the alignment and representation of different modalities, such as vision and language. Recent developments focus on enhancing the robustness and effectiveness of multimodal large language models (MLLMs) in various tasks, including visual instruction understanding and aspect-based sentiment analysis. Notable advancements include the use of machine teaching to evaluate the complexity of teaching MLLMs and the introduction of novel approaches to maximize cross-modal mutual information and prevent modality collapse. These innovative methods have shown significant improvements in performance and efficiency, paving the way for new directions in multimodal research. Some particularly noteworthy papers include:
- VISTA, which enhances vision-text alignment in MLLMs via cross-modal mutual information maximization.
- Visual Instruction Bottleneck Tuning, which improves the robustness of MLLMs under distribution shifts.
- RepBlend, a novel MDD framework that alleviates modality collapse by representation blending and symmetric projection trajectory matching.