Document Understanding in the Wild

The field of document understanding is moving towards more robust and comprehensive models that can handle real-world scenarios. Recent research has focused on developing benchmarks and models that can accurately parse and understand documents in natural environments, with variable illumination and physical distortions. This includes the development of multimodal large language models that can incorporate both textual and visual information to improve document understanding. Another area of focus is on information extraction from visually rich documents, with approaches that can organize documents into independent textual segments and perform more generalizable reasoning.

Noteworthy papers in this area include:

  • WildDoc, which introduces a new benchmark for assessing document understanding in natural environments and exposes the limitations of current models.
  • BLOCKIE, which proposes a novel LLM-based approach for information extraction from visually rich documents and achieves state-of-the-art performance on public benchmarks.
  • Dolphin, which presents a novel multimodal document image parsing model that achieves state-of-the-art performance across diverse page-level and element-level settings.
  • SCAN, which enhances both textual and visual Retrieval-Augmented Generation systems working with visually rich documents and improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%.

Sources

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments

Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation

Built with on top of