Natural Language to SQL Generation

The field of Natural Language to SQL (NL2SQL) generation is moving towards more accurate and robust models, with a focus on addressing the challenges of semantic gaps and poor benchmark quality. Recent work has introduced novel frameworks and datasets to improve the performance of NL2SQL models, including the use of guided generation, SQL2Text back-translation validation, and task decomposition. These advancements have led to significant improvements in execution accuracy and have highlighted the need for more rigorous dataset curation. Noteworthy papers include: GBV-SQL, which proposes a multi-agent framework for semantic validation and achieves a 5.8% absolute improvement on the BIRD benchmark. DeKeyNLU, which presents a novel dataset for refining task decomposition and enhancing keyword extraction precision, and achieves significant improvements in SQL generation accuracy on the BIRD and Spider dev datasets.

Sources

Agentic LLMs for Question Answering over Tabular Data

Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation

GBV-SQL: Guided Generation and SQL2Text Back-Translation Validation for Multi-Agent Text2SQL

DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction

Built with on top of