The field of multimodal video understanding is rapidly evolving, with a focus on developing more accurate and efficient models for video comprehension. Recent research has emphasized the importance of evaluating and improving multimodal reward models, which play a crucial role in training and inference of Large Vision Language Models. Additionally, there is a growing interest in detecting misleading video thumbnails, grounding multimodal misinformation, and improving video anomaly detection.
Noteworthy papers in this area include VideoRewardBench, which introduces a comprehensive benchmark for evaluating multimodal reward models in the video domain, and Kwai Keye-VL 1.5 Technical Report, which presents a novel Slow-Fast video encoding strategy and a progressive four-stage pre-training methodology for improving video understanding. ThumbnailTruth proposes a multi-modal detection pipeline for detecting misleading YouTube thumbnails, while A New Dataset and Benchmark for Grounding Multimodal Misinformation introduces the task of Grounding Multimodal Misinformation and presents a VLM-based baseline for effective detection and grounding.
Other notable papers include Video Parallel Scaling, which introduces an inference-time method for expanding a model's perceptual bandwidth without increasing its context window, and GTA-Crime, which presents a synthetic dataset and generation framework for fatal violence detection with adversarial snippet-level domain adaptation. MESH introduces a benchmark for evaluating hallucinations in Large Video Models, and AdsQA proposes a challenging ad video QA benchmark to evaluate the ability of LLMs in perceiving beyond the objective physical content of common visual domains. GeneVA introduces a large-scale artifact dataset with rich human annotations for spatio-temporal artifacts in videos generated from natural text prompts.