Advances in Video Generation and Understanding

The field of video generation and understanding is rapidly advancing, with a focus on developing more efficient, effective, and controllable models. Recent papers have introduced novel frameworks, such as PL-Stitch, ShowMe, and CtrlVDiff, which harness the power of self-supervised learning, diffusion models, and multimodal fusion to improve video representation learning, generation, and editing. These models have achieved state-of-the-art performance in various benchmarks, demonstrating their potential for real-world applications. Noteworthy papers, such as those on UltraViCo, Infinity-RoPE, and MoGAN, have pushed the boundaries of video generation, enabling infinite-horizon, controllable, and cinematic video diffusion, and improving motion quality through few-step motion adversarial post-training. Overall, the field is moving towards more sophisticated, flexible, and user-friendly video generation and understanding systems.

Sources

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

Test-Time Temporal Sampling for Efficient MLLM Video Understanding

Plan-X: Instruct Video Generation via Semantic Planning

SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining

Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization

Point-to-Point: Sparse Motion Guidance for Controllable Video Editing

FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Zero-Shot Video Deraining with Video Diffusion Models

ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction

STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

MagicWorld: Interactive Geometry-driven Video World Exploration

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

View-Consistent Diffusion Representations for 3D-Consistent Video Generation

Learning Plug-and-play Memory for Guiding Video Diffusion Models

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

In-Video Instructions: Visual Signals as Generative Control

Are Image-to-Video Models Good Zero-Shot Image Editors?

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

Complexity Reduction Study Based on RD Costs Approximation for VVC Intra Partitioning

Block Cascading: Training Free Acceleration of Block-Causal Video Models

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

MotionV2V: Editing Motion in a Video

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Layer-Aware Video Composition via Split-then-Merge

Smooth regularization for efficient video recognition

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Seeing without Pixels: Perception from Camera Trajectories