Video Reasoning and Temporal Grounding

The field of video reasoning and temporal grounding is rapidly advancing, with a focus on developing models that can understand and reason about complex video content. A key direction in this area is the integration of chain-of-thought reasoning and multimodal learning, enabling models to explicitly perform object-centric spatiotemporal reasoning and improve compositional video reasoning. Another important trend is the development of innovative approaches for video question answering, such as synergizing causal-aware query refinement with fine-grained visual grounding. Additionally, there is a growing interest in self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. The release of new benchmarks, such as CausalStep, is also driving progress in robust and interpretable video reasoning. Noteworthy papers in this area include:

  • CoTasks, which proposes a new framework for Chain-of-Thought based Video Instruction Tuning Tasks.
  • LeAdQA, which introduces an innovative approach that bridges the gaps in current video question answering methods through synergizing causal-aware query refinement with fine-grained visual grounding.
  • Talk2Event, which provides a large-scale benchmark for language-driven object grounding in event-based perception.

Sources

CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Grounding Degradations in Natural Language for All-In-One Video Restoration

Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

Built with on top of