Vision-Language Model Developments

The field of vision-language models is moving towards addressing the limitations of current models, particularly in terms of multimodal processing and compositionality. Researchers are exploring new architectures and techniques to improve the performance of these models, such as incorporating image context and memory, and using adversarial negative mining to balance modality preferences. Additionally, there is a growing focus on the importance of visual processing and perturbation in multimodal reasoning. Notable papers in this area include: LLMs Can Compensate for Deficiencies in Visual Representations, which investigates the role of language backbones in compensating for weak visual features. CoMemo: LVLMs Need Image Context with Image Memory, which proposes a dual-path architecture to alleviate visual information neglect. Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning, which demonstrates the effectiveness of simple visual perturbations in improving mathematical reasoning performance.

Sources

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

LLMs Can Compensate for Deficiencies in Visual Representations

CoMemo: LVLMs Need Image Context with Image Memory

Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Adding simple structure at inference improves Vision-Language Compositionality

Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

Built with on top of