The field of multimodal representation learning is moving towards developing more effective methods for aligning and fusing representations across different modalities. Recent research has focused on addressing the challenges of heterogeneous category sets, character frequency distribution shifts, and catastrophic forgetting in multimodal class-incremental learning. Notable advancements include the proposal of novel loss functions, adaptive fusion mechanisms, and multimodal pre-trained models that enhance the accuracy and robustness of multimodal systems. Noteworthy papers include:
- Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach, which proposes a method for integrating cross-modal embeddings with single-modality embeddings.
- Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition, which introduces a novel loss function that incorporates the Wasserstein distance between character frequency distributions.
- Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion, which explores multimodal class-incremental learning across vision, audio, and text modalities and proposes an adaptive audio-visual fusion module.