Advancements in Multimodal Video Understanding

The field of multimodal video understanding is rapidly evolving, with a focus on developing more accurate and efficient models for video comprehension. Recent research has emphasized the importance of evaluating and improving multimodal reward models, which play a crucial role in training and inference of Large Vision Language Models. Additionally, there is a growing interest in detecting misleading video thumbnails, grounding multimodal misinformation, and improving video anomaly detection.

Noteworthy papers in this area include VideoRewardBench, which introduces a comprehensive benchmark for evaluating multimodal reward models in the video domain, and Kwai Keye-VL 1.5 Technical Report, which presents a novel Slow-Fast video encoding strategy and a progressive four-stage pre-training methodology for improving video understanding. ThumbnailTruth proposes a multi-modal detection pipeline for detecting misleading YouTube thumbnails, while A New Dataset and Benchmark for Grounding Multimodal Misinformation introduces the task of Grounding Multimodal Misinformation and presents a VLM-based baseline for effective detection and grounding.

Other notable papers include Video Parallel Scaling, which introduces an inference-time method for expanding a model's perceptual bandwidth without increasing its context window, and GTA-Crime, which presents a synthetic dataset and generation framework for fatal violence detection with adversarial snippet-level domain adaptation. MESH introduces a benchmark for evaluating hallucinations in Large Video Models, and AdsQA proposes a challenging ad video QA benchmark to evaluate the ability of LLMs in perceiving beyond the objective physical content of common visual domains. GeneVA introduces a large-scale artifact dataset with rich human annotations for spatio-temporal artifacts in videos generated from natural text prompts.

Sources

VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

Kwai Keye-VL 1.5 Technical Report

ThumbnailTruth: A Multi-Modal LLM Approach for Detecting Misleading YouTube Thumbnails Across Diverse Cultural Settings

A New Dataset and Benchmark for Grounding Multimodal Misinformation

Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation

MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models

AdsQA: Towards Advertisement Video Understanding

GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

Built with on top of