Physics-Aware Video Generation

The field of video generation is moving towards incorporating physical awareness and realism into generated videos. Recent developments have focused on leveraging large language models and foundation models to guide the generation process and improve physics understanding. This has led to significant improvements in generating physically plausible content, including realistic object interactions and motion patterns. Notable advancements include the use of relational alignment, token-level distillation, and multimodal large language models to enhance physics awareness in video generation models. Noteworthy papers include: DiffPhy, which proposes a generic framework for physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. PanoWan, which introduces a method to lift pre-trained text-to-video models to the panoramic domain, achieving state-of-the-art performance in panoramic video generation. MOVi, which presents a novel training-free approach for multi-object video generation that leverages open world knowledge of diffusion models and large language models. VideoREPA, which proposes a framework to distill physics understanding capability from video understanding foundation models into text-to-video models by aligning token-level relations.

Physics-Aware Video Generation

Sources