Synthetic Data Generation for Tabular and Textual Data

The field of synthetic data generation is moving towards more sophisticated methods for generating high-fidelity samples, particularly for tabular and textual data. Researchers are exploring new techniques to address challenges such as class imbalance, data scarcity, and noise. One notable direction is the use of conditional Generative Adversarial Networks (GANs) and probabilistic sampling strategies to generate samples that resemble the original data distribution. Another area of focus is the application of synthetic data generation to specific domains, such as healthcare and finance, where data privacy and scarcity are significant concerns. Noteworthy papers in this area include:

A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces, which presents a novel approach to alleviating class imbalance in tabular datasets.
Synthetic medical data generation: state of the art and application to trauma mechanism classification, which proposes a methodology for generating high-quality synthetic medical records.
Generating Synthetic Invoices via Layout-Preserving Content Replacement, which introduces a pipeline for generating high-fidelity synthetic invoice documents and their corresponding structured data.
Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation, which leverages synthetic data generation to augment existing transaction data sets and improve classification accuracy.

Synthetic Data Generation for Tabular and Textual Data

Sources