Advances in Synthetic Data Generation and Molecular Property Prediction

The field of large language models and molecular property prediction is rapidly advancing, with a focus on improving synthetic data generation and out-of-distribution performance. Recent developments have shown that incorporating cross-document knowledge associations and graph-based methods can enhance synthetic data diversity and coherence, leading to better generalization capabilities. Additionally, the use of large language models and multi-modal fusion approaches has improved molecular property prediction, enabling the discovery of new molecules with diverse and strong protein binding affinity. Noteworthy papers in this area include PoseX, which proposes an open-source benchmark for protein-ligand cross docking, and BOOM, which presents a benchmark study for out-of-distribution molecular property predictions. MatMMFuse is also notable for its multi-modal fusion approach to material property prediction, showing improved performance compared to single-modality models.

Sources

Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking

BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning

34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery

MatMMFuse: Multi-Modal Fusion model for Material Property Prediction

Towards Artificial Intelligence Research Assistant for Expert-Involved Learning

Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Built with on top of