Multimodal Learning Advancements

The field of multimodal learning is rapidly advancing, with a focus on developing more effective methods for aligning and integrating multiple modalities, such as vision, language, and audio. Recent research has highlighted the importance of addressing the modality gap and improving the robustness of multimodal models to spurious correlations and group imbalance. Notable advancements include the development of novel loss formulations, such as Generalized Contrastive Learning, and the introduction of new evaluation metrics, like MAJORScore, which can accurately assess the relevance of multiple modalities. Additionally, researchers have proposed innovative approaches, including Hierarchical Representation Matching and Semantic Modality Bridge, to enhance the performance of multimodal models in various tasks, such as few-shot learning and cross-modal retrieval. Some papers that are particularly noteworthy in this regard include Cross-Modal Retrieval with Cauchy-Schwarz Divergence, which introduces a hyperparameter-free measure for improving training stability and retrieval performance, and VT-FSL, which proposes a novel framework for few-shot learning that seamlessly integrates vision and text through a geometry-aware alignment.

Sources

Cross-Modal Retrieval with Cauchy-Schwarz Divergence

MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation

Hierarchical Representation Matching for CLIP-based Class-Incremental Learning

GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning

Does Weak-to-strong Generalization Happen under Spurious Correlations?

Semantic Compression via Multimodal Representation Learning

A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Generalized Contrastive Learning for Universal Multimodal Retrieval

MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Built with on top of