Advances in Handling Missing Data

The field of missing data handling is witnessing significant developments, with a growing focus on innovative methods to impute and generate high-quality data. Researchers are exploring new approaches to tackle complex missingness patterns, leveraging techniques such as multi-task learning, masked autoencoding, and synthetic data generation. These advancements have the potential to improve the performance of machine learning models in various applications, including healthcare, marketing, and biomedical fields. Noteworthy papers in this area include:

  • CACTI, which leverages copy masking and contextual information to improve tabular data imputation, achieving state-of-the-art results.
  • The proposed agentic framework for missing modality prediction, which dynamically formulates modality-aware mining strategies and introduces a self-refinement mechanism to enhance generated modalities.
  • LSM-2 with Adaptive and Inherited Masking, a novel self-supervised learning approach that learns robust representations directly from incomplete wearable sensor data.

Sources

Multi-task Learning for Heterogeneous Multi-source Block-Wise Missing Data

CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

PandasBench: A Benchmark for the Pandas API

CART-based Synthetic Tabular Data Generation for Imbalanced Regression

How Far Are We from Predicting Missing Modalities with Foundation Models?

N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion

Does Prompt Design Impact Quality of Data Imputation by LLMs?

LSM-2: Learning from Incomplete Wearable Sensor Data

Built with on top of