Causal Inference and Synthetic Data in AI Research

The field of artificial intelligence is moving towards a greater emphasis on causal inference and the use of synthetic data. Researchers are exploring new methods for generating high-quality synthetic data that can be used to train and evaluate machine learning models, particularly in situations where labeled data is scarce. This includes the development of novel generalization bounds and optimization methods for synthetic data generation. Additionally, there is a growing interest in using synthetic data to estimate the true error of machine learning models and to improve the robustness of large language models. Noteworthy papers include: Using Synthetic Data to estimate the True Error, which proposes a method for optimizing synthetic samples for model evaluation. SynQuE: Estimating Synthetic Dataset Quality Without Annotations, which introduces a framework for ranking synthetic datasets by their expected real-world task performance. Towards Causal Market Simulators, which proposes a Time-series Neural Causal Model VAE for generating counterfactual financial time series.

Sources

A Technical Exploration of Causal Inference with Hybrid LLM Synthetic Data

Using Synthetic Data to estimate the True Error is theoretically and practically doable

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Metamorphic Testing of Large Language Models for Natural Language Processing

SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Towards Causal Market Simulators

Built with on top of