The field of information extraction and visual question answering is witnessing significant advancements, driven by the development of innovative models and techniques. Researchers are exploring new approaches to extract relevant information from unstructured documents, such as financial reports and images with dense text, and to improve the accuracy of visual question answering systems. A key direction in this field is the integration of spatial awareness and multimodal embeddings to enhance the understanding of complex documents and images. Noteworthy papers in this area include:
- Towards Efficient Quantity Retrieval from Text, which proposes a framework for quantity retrieval based on description parsing and weak supervision.
- Spatial ModernBERT, which introduces a transformer-based model for table and key-value extraction in financial documents.
- Describe Anything Model for Visual Question Answering on Text-rich Images, which investigates the use of region-aware vision-language models for visual question answering.
- Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering, which presents a fully training-free and model-agnostic pipeline for generating natural language rationales and grounding them to spatial sub-regions.
- FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering, which proposes a novel approach to strengthen the reasoning ability of video question answering models by generating fundamental questions based on video descriptions.