Multimodal Understanding and Honesty in AI Systems

The field of multimodal understanding is moving towards more complex and nuanced tasks, such as evaluating sustainability practices and promoting financial transparency. Researchers are developing new benchmarks and datasets to assess the performance of multimodal models, including their ability to handle visually grounded and cross-page tasks. Additionally, there is a growing focus on the honesty and trustworthiness of multimodal large language models, with studies investigating their behavior when faced with unanswerable visual questions. Noteworthy papers in this area include MMESGBench, which introduces a first-of-its-kind benchmark dataset for multimodal understanding and complex reasoning in ESG documents. MoHoBench is another notable paper, presenting a large-scale benchmark for assessing the honesty of multimodal large language models via unanswerable visual questions.

Sources

Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks

Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Large Language Models in the Travel Domain: An Industrial Experience

Built with on top of