Advances in Document Analysis and Understanding

The field of document analysis and understanding is moving towards more accurate and efficient methods for extracting information from historical and multilingual documents. Researchers are developing innovative approaches to improve the transcription accuracy of noisy historical documents, such as using ensemble frameworks and custom aligners. There is also a growing interest in benchmarking vision-language models on ancient documents, with a focus on evaluating their performance on tasks such as OCR, translation, and knowledge reasoning. Additionally, new datasets and benchmarks are being introduced to support the development of models for minority languages and low-resource scenarios. Notable papers in this area include: Improving MLLM Historical Record Extraction with Test-Time Image, which presents a novel ensemble framework for stabilizing LLM-based text extraction from noisy historical documents. VARCO-VISION-2.0 Technical Report, which introduces an open-weight bilingual vision-language model for Korean and English with improved capabilities compared to previous models. PATIMT-Bench, which constructs a multi-scenario benchmark for position-aware text image machine translation in large vision-language models.

Sources

Improving MLLM Historical Record Extraction with Test-Time Image

Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

VARCO-VISION-2.0 Technical Report

PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models

ICDAR 2025 Competition on FEw-Shot Text line segmentation of ancient handwritten documents (FEST)

TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images

Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation

Built with on top of