Advances in Multimodal Learning

The field of multimodal learning is rapidly advancing, with a focus on improving the performance and efficiency of large language models and vision transformers. Recent research has highlighted the importance of aligning visual and language representations, and several papers have proposed innovative methods for achieving this alignment. One key direction is the development of more effective tokenization strategies, such as Subpixel Placement of Tokens (SPoT), which allows for continuous positioning of tokens within images. Another important area of research is the refinement of attention mechanisms, including the use of attention ablation techniques to suppress detrimental attention heads. Additionally, there is a growing interest in exploring alternative pretraining objectives, such as Causal Language Modeling (CLM), which has been shown to be more data-efficient and stable than traditional Masked Language Modeling (MLM) approaches. Noteworthy papers include Grounding-Aware Token Pruning, which proposes a simple yet effective adjustment to position IDs to recover from drastic performance drops in visual grounding, and VisionDrop, which introduces a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal attention. LaCo, a novel framework for layer-wise compression of visual tokens, has also demonstrated superior effectiveness in reducing computational costs while maintaining strong performance.

Sources

Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning

Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

Unified Multimodal Understanding via Byte-Pair Visual Encoding

ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

Should We Still Pretrain Encoders with Masked Language Modeling?

Escaping Platos Cave: JAM for Aligning Independently Trained Vision and Language Models

MARVIS: Modality Adaptive Reasoning over VISualizations

SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement

SPoT: Subpixel Placement of Tokens in Vision Transformers

LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

Built with on top of