The field of large language models and molecular property prediction is rapidly advancing, with a focus on improving synthetic data generation and out-of-distribution performance. Recent developments have shown that incorporating cross-document knowledge associations and graph-based methods can enhance synthetic data diversity and coherence, leading to better generalization capabilities. Additionally, the use of large language models and multi-modal fusion approaches has improved molecular property prediction, enabling the discovery of new molecules with diverse and strong protein binding affinity. Noteworthy papers in this area include PoseX, which proposes an open-source benchmark for protein-ligand cross docking, and BOOM, which presents a benchmark study for out-of-distribution molecular property predictions. MatMMFuse is also notable for its multi-modal fusion approach to material property prediction, showing improved performance compared to single-modality models.
Advances in Synthetic Data Generation and Molecular Property Prediction
Sources
Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models
Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning