The field of multimodal understanding is moving towards more complex and nuanced tasks, such as evaluating sustainability practices and promoting financial transparency. Researchers are developing new benchmarks and datasets to assess the performance of multimodal models, including their ability to handle visually grounded and cross-page tasks. Additionally, there is a growing focus on the honesty and trustworthiness of multimodal large language models, with studies investigating their behavior when faced with unanswerable visual questions. Noteworthy papers in this area include MMESGBench, which introduces a first-of-its-kind benchmark dataset for multimodal understanding and complex reasoning in ESG documents. MoHoBench is another notable paper, presenting a large-scale benchmark for assessing the honesty of multimodal large language models via unanswerable visual questions.