Advancements in Multimodal Understanding and Safety

The field of multimodal understanding is rapidly advancing, with a growing focus on developing models that can effectively comprehend and generate multimodal content. Recent research has highlighted the importance of incorporating cultural awareness and safety considerations into multimodal models, particularly in applications such as vision-language models and large language models. Noteworthy papers in this area include GuardReasoner-VL, which introduces a novel reasoning-based model for safeguarding vision-language models, and VideoSafetyBench, which presents a large-scale benchmark for evaluating the safety of video-based large language models. These developments have significant implications for the development of more robust and culturally aware multimodal systems.

Sources

Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs

The Effects of Moral Framing on Online Fundraising Outcomes: Evidence from GoFundMe Campaigns

GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse

From Evaluation to Defense: Advancing Safety in Video Large Language Models

Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu