The field of document understanding is moving towards more robust and comprehensive models that can handle real-world scenarios. Recent research has focused on developing benchmarks and models that can accurately parse and understand documents in natural environments, with variable illumination and physical distortions. This includes the development of multimodal large language models that can incorporate both textual and visual information to improve document understanding. Another area of focus is on information extraction from visually rich documents, with approaches that can organize documents into independent textual segments and perform more generalizable reasoning.
Noteworthy papers in this area include:
- WildDoc, which introduces a new benchmark for assessing document understanding in natural environments and exposes the limitations of current models.
- BLOCKIE, which proposes a novel LLM-based approach for information extraction from visually rich documents and achieves state-of-the-art performance on public benchmarks.
- Dolphin, which presents a novel multimodal document image parsing model that achieves state-of-the-art performance across diverse page-level and element-level settings.
- SCAN, which enhances both textual and visual Retrieval-Augmented Generation systems working with visually rich documents and improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%.