The field of vision-language models is moving towards addressing the limitations of current models, particularly in terms of multimodal processing and compositionality. Researchers are exploring new architectures and techniques to improve the performance of these models, such as incorporating image context and memory, and using adversarial negative mining to balance modality preferences. Additionally, there is a growing focus on the importance of visual processing and perturbation in multimodal reasoning. Notable papers in this area include: LLMs Can Compensate for Deficiencies in Visual Representations, which investigates the role of language backbones in compensating for weak visual features. CoMemo: LVLMs Need Image Context with Image Memory, which proposes a dual-path architecture to alleviate visual information neglect. Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning, which demonstrates the effectiveness of simple visual perturbations in improving mathematical reasoning performance.