The field of world modeling and video prediction is moving towards more scalable and effective approaches, leveraging pre-trained models and leveraging inductive biases of large language models to improve performance. Researchers are exploring new architectures and training objectives to enable autoregressive generation and action controllability in world models. Additionally, there is a growing interest in incorporating physical plausibility and consistency in video generation, with methods that adaptively steer diffusion models towards physics-relevant motion. Noteworthy papers include:
- Vid2World, which presents a general approach for repurposing pre-trained video diffusion models into interactive world models.
- ProgGen, which proposes a method for programmatic video prediction using large language models to synthesize plausible visual futures.
- MAGIC, which introduces a training-free framework for single-image physical property inference and dynamic generation.
- ForeDiff, which proposes a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising.