Advances in Text-to-Video Generation: Physical Realism and Controllability

The field of text-to-video generation is rapidly advancing, with a focus on improving physical realism and controllability. Recent papers have introduced new benchmarks and evaluation methods to assess the physical realism of generated videos, such as PhyWorldBench and PhysVidBench. These benchmarks test the ability of models to simulate physical phenomena, including object motion, energy conservation, and tool use. Other papers have proposed novel architectures and techniques to improve the controllability of text-to-video generation, such as the use of neighborhood adaptive block-level attention and vectorized timestep adaptation. Noteworthy papers include: PUSA V1.0, which surpasses the performance of previous models with significantly reduced training cost, and MotionShot, which achieves high-fidelity motion transfer across objects with significant appearance and structure disparities. Overall, the field is moving towards more realistic and controllable video generation, with potential applications in various areas, including cinematic production, medical imaging, and interactive world generation.

Sources

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Generative AI-Driven High-Fidelity Human Motion Simulation

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Temporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction

Controllable Video Generation: A Survey

EndoGen: Conditional Autoregressive Endoscopic Video Generation

Yume: An Interactive World Generation Model

Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA

Enhancing Scene Transition Awareness in Video Generation via Post-Training

T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation

Captain Cinema: Towards Short Movie Generation

Built with on top of