Advances in Information Extraction and Web Corpus Construction

The field of information extraction and web corpus construction is moving towards more sophisticated and scalable methods for extracting structured data from complex documents and web pages. Researchers are exploring the use of large language models and novel extraction pipelines to improve the accuracy and robustness of information extraction. One notable trend is the development of model-based approaches that leverage semantic understanding to extract structured elements such as tables, formulas, and code blocks. These advances have the potential to significantly impact downstream applications such as predictive modeling and language model training. Noteworthy papers include: Information Extraction From Fiscal Documents Using LLMs, which demonstrates the effectiveness of LLMs in extracting structured data from fiscal documents. AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser, which introduces a novel extraction pipeline that achieves state-of-the-art results in HTML extraction and corpus construction.

Sources

Information Extraction From Fiscal Documents Using LLMs

Benchmarking Table Extraction from Heterogeneous Scientific Extraction Documents

From Patents to Dataset: Scraping for Oxide Glass Compositions and Properties

AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Built with on top of