Advancements in Video-Language Understanding

The field of video-language understanding is rapidly evolving, with a focus on improving the accuracy and efficiency of models in understanding and generating text based on video content. Recent developments have highlighted the importance of mitigating hallucinations in multimodal large language models, with various approaches being proposed to address this issue. Additionally, there is a growing emphasis on developing more generalizable and adaptable models that can handle a wide range of video understanding tasks, including long-form video analysis and video reasoning. Notable papers in this area include Affordance-First Decomposition, which achieves state-of-the-art results in continual learning for video-language understanding, and OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks. Other notable papers include Med-VCD, which proposes a sparse visual-contrastive decoding method to mitigate hallucinations in medical large vision language models, and TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens multimodal large language models' temporal comprehension.

Sources

Affordance-First Decomposition for Continual Learning in Video-Language Understanding

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

TBT-Former: Learning Temporal Boundary Distributions for Action Localization

\textit{ViRectify}: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models

VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis

COACH: Collaborative Agents for Contextual Highlighting - A Multi-Agent Framework for Sports Video Analysis

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

OneThinker: All-in-one Reasoning Model for Image and Video

ViDiC: Video Difference Captioning

EEA: Exploration-Exploitation Agent for Long Video Understanding

V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding