Advances in Synthetic Data Generation with Large Language Models

The field of synthetic data generation is experiencing significant growth, driven by the increasing capabilities of large language models (LLMs). Researchers are exploring innovative methods to improve the accuracy, diversity, and feasibility of synthetic data, enabling its use in a wide range of applications, from requirements engineering to population synthesis. A key trend is the development of probability-driven prompting approaches and fine-tuning methods that leverage LLMs to estimate conditional distributions and control the autoregressive generation process. These advances are leading to more accurate and scalable data synthesis, with potential benefits for downstream tasks such as simulation, modeling, and machine learning. Noteworthy papers include:

A Large Language Model for Feasible and Diverse Population Synthesis, which proposes a hybrid LLM-BN approach that achieves high feasibility and diversity in population synthesis.
Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models, which introduces a product line approach that leverages LLMs to systematically generate synthetic RE data and demonstrates its potential to address data scarcity in RE.

Advances in Synthetic Data Generation with Large Language Models

Sources