Multimodal Vision-Language Understanding

The field of multimodal vision-language understanding is moving towards addressing the challenges of complex and diverse real-world scenarios. Researchers are focusing on developing models that can effectively integrate visual and textual features to improve performance in tasks such as visual question answering. A key direction is the development of benchmarks and datasets for low-resource languages, which will facilitate the creation of more inclusive AI systems. Noteworthy papers in this area include: MEENA, a dataset designed to evaluate Persian vision-language models, which introduces a bilingual structure to assess cross-linguistic performance. PlantVillageVQA, a large-scale visual question answering dataset for plant science, which provides a publicly available and expert-verified database to enhance diagnostic accuracy for plant disease identifications.

Sources

Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science

MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment

Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

Built with on top of