The field of vision-language models is moving towards more effective integration of visual and linguistic information, with a focus on addressing the visual processing bottleneck that hinders performance on complex tasks. This is being achieved through the development of novel frameworks that equip models with dynamic latent vision memories, allowing for better retention of visual evidence and semantic consistency. Another area of innovation is the use of lightweight fusion modules to directly align the hidden states of vision and language modalities, enabling more efficient and effective processing. Additionally, research is exploring the use of smart glasses and gaze tracking to enhance human memory through active visual logging, with potential applications in real-world scenarios. Noteworthy papers include: VisMem, which proposes a cognitively-aligned framework for latent vision memory enhancement, and Gaze Archive, which introduces a novel visual memory enhancement paradigm through active logging on smart glasses. BRIDGE is also notable for its lightweight fusion module that improves the alignment of vision and language modalities.