The field of vision-language modeling is rapidly advancing, with a focus on improving the ability of models to comprehend and generate visual content. Recent developments have led to the creation of more effective pipelines for training vision-language models, enabling them to achieve state-of-the-art performance in tasks such as math problem solving and document visual question answering. The use of large-scale datasets and novel evaluation frameworks has also facilitated the development of more robust and generalizable models. Notably, the integration of vision-language models with other modalities, such as sketches and diagrams, has opened up new avenues for research and application. Noteworthy papers include: VP-Bench, which introduces a comprehensive benchmark for assessing multimodal large language models' capability in visual prompt perception and utilization. O3SLM, which presents a new large-scale dataset and a sketch-language model that achieves state-of-the-art performance in sketch comprehension and reasoning.