Advances in Synthetic Data Generation with Large Language Models

The field of synthetic data generation is experiencing significant growth, driven by the increasing capabilities of large language models (LLMs). Researchers are exploring innovative methods to improve the accuracy, diversity, and feasibility of synthetic data, enabling its use in a wide range of applications, from requirements engineering to population synthesis. A key trend is the development of probability-driven prompting approaches and fine-tuning methods that leverage LLMs to estimate conditional distributions and control the autoregressive generation process. These advances are leading to more accurate and scalable data synthesis, with potential benefits for downstream tasks such as simulation, modeling, and machine learning. Noteworthy papers include:

  • A Large Language Model for Feasible and Diverse Population Synthesis, which proposes a hybrid LLM-BN approach that achieves high feasibility and diversity in population synthesis.
  • Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models, which introduces a product line approach that leverages LLMs to systematically generate synthetic RE data and demonstrates its potential to address data scarcity in RE.

Sources

A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models

A Large Language Model for Feasible and Diverse Population Synthesis

A Design Space for the Critical Validation of LLM-Generated Tabular Data

AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection

Built with on top of