The field of multimodal research is moving towards improving the efficiency of large models, particularly in vision-and-language tasks. Recent developments focus on reducing computational overhead and inference latency while preserving performance. This is achieved through innovative methods such as token pruning, sparse training schemes, and novel architectural designs. Noteworthy papers include:
- Navigation-Aware Pruning, which significantly outperforms prior work on Vision-and-Language Navigation tasks while saving over 50% FLOPS.
- Pyramid Token Pruning, which substantially reduces computational overhead and inference latency with minimal performance loss.
- Sparse Training Scheme, which demonstrates effectiveness and efficiency in training Multimodal Large Language Models.
- Reading Images Like Texts, which offers a deeper understanding of Vision-Language Model internals and provides principles for designing more capable architectures.
- EmbeddingGemma, which achieves state-of-the-art results with a lightweight and open text embedding model.