Advances in Privacy-Preserving Synthetic Data Generation

The field of synthetic data generation is rapidly evolving, with a growing focus on privacy preservation and data utility. Researchers are exploring innovative approaches to balance these competing demands, including the development of unified frameworks for evaluating synthetic tabular data and reinforcement learning methods for creative writing. Notably, the use of large language models (LLMs) is becoming increasingly prevalent, with applications in data reconstruction, synthetic rewriting, and privacy-preserving text generation.

A key direction in this field is the development of context-aware privacy measures, which can provide stronger privacy protection than traditional context-free definitions. Additionally, the use of semantic triples and local differential privacy guarantees is showing promise in private document generation.

Some noteworthy papers in this area include: FEST, a unified framework for evaluating synthetic tabular data, which integrates diverse privacy metrics and machine learning utility metrics. RLMR, a reinforcement learning method for creative writing, which utilizes a dynamically mixed reward system to balance subjective writing quality and objective constraint following. The Double-edged Sword of LLM-based Data Reconstruction, which explores the use of LLMs to exploit contextual vulnerability in differentially private text sanitization and proposes recommendations for using LLM data reconstruction as a post-processing step. RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting, which proposes a reinforcement learning framework that fine-tunes a large language model using a composite reward function to jointly optimize for explicit and implicit privacy, semantic fidelity, and output diversity.

Advances in Privacy-Preserving Synthetic Data Generation

Sources