Advancements in Video Understanding and Reasoning

The field of video understanding and reasoning is moving towards addressing complex tasks that require multimodal understanding, temporal reasoning, and spatial awareness. Researchers are developing innovative datasets and models that can learn from videos, understand temporal patterns, and recognize spatial relationships. A key direction is the development of benchmarks that can evaluate the limitations of current models and inspire further research. Notable papers in this area include: VideoCAD, which introduces a large-scale synthetic dataset for learning UI interactions and 3D reasoning from CAD software, and proposes a state-of-the-art model for learning CAD interactions directly from video. Time Blindness, which highlights the limitation of vision-language models in capturing purely temporal patterns and introduces a benchmark to catalyze research in temporal pattern recognition. Seeing the Arrow of Time, which tackles the deficiency of modern large multimodal models in perceiving and utilizing temporal directionality in video and introduces a reinforcement learning-based training strategy to instill Arrow of Time awareness.

Sources

VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Seeing the Arrow of Time in Large Multimodal Models

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Built with on top of