Multilingual Document Analysis

The field of document analysis is moving towards a more multilingual and multimodal approach, with a focus on addressing the scarcity of resources for non-English languages and the structural complexity of official publications. Recent developments have led to the creation of large-scale synthetic corpora and benchmark datasets for visual document retrieval, which enable the evaluation of models across textual and multimodal retrieval tasks. These advancements have the potential to improve the accuracy and reliability of document analysis systems, particularly in real-world applications such as financial information retrieval and historical document transcription. Noteworthy papers include:

  • Cross-Lingual SynthDocs, which provides a scalable and visually realistic resource for advancing research in multilingual document analysis.
  • SDS KoPub VDR, which establishes a challenging and reliable evaluation set for visual document retrieval in Korean public documents.
  • DKDS, which introduces a new benchmark dataset for detecting and binarizing degraded Kuzushiji documents with seals.

Sources

Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval

DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Built with on top of