Advancements in Video Understanding and Quality Assessment

The field of video understanding and quality assessment is moving towards more comprehensive and realistic evaluations, with a focus on multimodal large language models (MLLMs) and their applications in various tasks such as video question answering, temporal grounding, and anomaly detection. Researchers are developing new benchmarks and datasets to address the limitations of existing ones, including the lack of real-world user-generated content and the need for more diverse and challenging scenarios. Notable papers in this area include CHUG, which introduces a large-scale subjective study on user-generated HDR video quality, and OmniVideoBench, which provides a rigorously designed benchmark for assessing synergistic audio-visual understanding. Other papers, such as SeqBench and CausalVerse, focus on evaluating sequential narrative coherence in text-to-video generation and causal representation learning, respectively. Overall, the field is advancing towards more accurate and generalizable models that can handle complex and dynamic video content.

Sources

CHUG: Crowdsourced User-Generated HDR Video Quality Dataset

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis

SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

No-Reference Rendered Video Quality Assessment: Dataset and Metrics

CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Built with on top of