The field of document analysis and understanding is moving towards more accurate and efficient methods for extracting information from historical and multilingual documents. Researchers are developing innovative approaches to improve the transcription accuracy of noisy historical documents, such as using ensemble frameworks and custom aligners. There is also a growing interest in benchmarking vision-language models on ancient documents, with a focus on evaluating their performance on tasks such as OCR, translation, and knowledge reasoning. Additionally, new datasets and benchmarks are being introduced to support the development of models for minority languages and low-resource scenarios. Notable papers in this area include: Improving MLLM Historical Record Extraction with Test-Time Image, which presents a novel ensemble framework for stabilizing LLM-based text extraction from noisy historical documents. VARCO-VISION-2.0 Technical Report, which introduces an open-weight bilingual vision-language model for Korean and English with improved capabilities compared to previous models. PATIMT-Bench, which constructs a multi-scenario benchmark for position-aware text image machine translation in large vision-language models.