Advances in Physically Plausible Perception and Video World Models

The field of computer vision is moving towards incorporating temporal cues and physically plausible perception into models. This is evident in the development of video world models that can predict future frames and understand intuitive physics. Researchers are also exploring the use of multimodal cues and raw temporal video to improve spatiotemporal perception and intent inference. Furthermore, there is a growing interest in pre-training models on heterogeneous data to enable adaptive and zero-shot human trajectory prediction. Notable papers include: Video Self-Distillation for Single-Image Encoders, which introduces a lightweight route to geometry-aware perception, and Back to the Features: DINO as a Foundation for Video World Models, which presents a powerful generalist video world model that outperforms previous models on various benchmarks. Additionally, Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues achieves state-of-the-art results in pedestrian intention prediction without requiring extensive retraining, and OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction proposes a robust solution for zero-shot transfer to unseen datasets with varying temporal dynamics.

Advances in Physically Plausible Perception and Video World Models

Sources