The field of predictive modeling for virtual and physical interactions is moving towards more sophisticated and nuanced approaches. Researchers are exploring the use of machine learning models, such as regression-based approaches and multimodal models, to improve the accuracy and interpretability of predictions. These models are being applied to a range of tasks, including predicting user grasp intentions in virtual reality, discovering physical laws from observational data, and generating future frames in video prediction. Noteworthy papers in this area include: Predicting User Grasp Intentions in Virtual Reality, which demonstrates the potential of regression-based approaches for predicting user intentions in VR. Mimicking the Physicist's Eye, which proposes a multimodal model for discovering physical laws from observational data and achieves state-of-the-art performance in accuracy and interpretability. FlowVLA, which introduces a pre-training framework for predicting future frames in video prediction and demonstrates improved sample efficiency. Ego-centric Predictive Model Conditioned on Hand Trajectories, which proposes a unified two-stage predictive framework for jointly modeling action and visual future in egocentric scenarios. SPGrasp, which achieves low-latency inference while maintaining promptability for real-time interactive grasp synthesis. Learning Primitive Embodied World Models, which proposes a novel paradigm for world modeling that restricts video generation to fixed short horizons and enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.