The field of multimodal understanding is rapidly advancing, with a growing focus on developing models that can effectively comprehend and generate multimodal content. Recent research has highlighted the importance of incorporating cultural awareness and safety considerations into multimodal models, particularly in applications such as vision-language models and large language models. Noteworthy papers in this area include GuardReasoner-VL, which introduces a novel reasoning-based model for safeguarding vision-language models, and VideoSafetyBench, which presents a large-scale benchmark for evaluating the safety of video-based large language models. These developments have significant implications for the development of more robust and culturally aware multimodal systems.