Multimodal Learning with Limited Data

The field of multimodal learning is shifting towards developing models that can perform well with limited paired data. Researchers are exploring innovative methods to align pretrained unimodal foundation models, reducing the need for large amounts of labeled data. One approach is to use effective regularization techniques that preserve the neighborhood geometry of the latent space of unimodal encoders. Another direction is to develop adaptive frameworks that can efficiently perform cross-modal distillation and policy learning, enabling efficient inference across different tasks. Additionally, knowledge distillation frameworks are being proposed to address modality imbalance and optimize multimodal models. Noteworthy papers in this area include:

A work that introduces STRUCTURE, a regularization technique that enables high-quality alignment with limited paired data, yielding substantial gains in zero-shot classification and retrieval tasks.
EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning for efficient egocentric perception, significantly enhancing efficiency while maintaining performance.
G2D, a knowledge distillation framework that optimizes multimodal models with a custom-built loss function and dynamic sequential modality prioritization, amplifying the significance of weak modalities and outperforming state-of-the-art methods.

Multimodal Learning with Limited Data

Sources