Advances in Video Understanding and Anomaly Detection

The field of video understanding and anomaly detection is rapidly evolving, with a focus on developing more accurate and efficient models for real-world applications. Recent research has explored the use of multimodal large language models (MLLMs) and large language models (LLMs) to improve video anomaly detection, enabling the identification and grounding of anomalous behaviors or events in videos.

Notable advancements include the development of dual-branch architectures, such as the Dual-Branch Adaptive Multiscale Spatiotemporal Framework (DAMS), which integrates hierarchical feature learning and complementary information to efficiently detect anomalies. Additionally, the introduction of test-time training and difficulty-aware group relative policy optimization (GRPO) has enhanced the performance of MLLMs in industrial anomaly detection.

Other innovative approaches include the use of visual question answering (VQA) models for classroom activity monitoring, and the application of explainable deep learning anomaly detection with sequential hypothesis testing for robotic sewer inspection.

Some particularly noteworthy papers in this area include the proposal of AF-CLIP, which dramatically enhances visual representations to focus on local defects, and the introduction of VAGU, the first benchmark to integrate both video anomaly grounding and understanding tasks. The EMIT framework, which enhances MLLMs for industrial anomaly detection via difficulty-aware GRPO, also demonstrates significant improvements in performance.

Sources

Object-centric Video Question Answering with Visual Grounding and Referring

LAVA: Language Driven Scalable and Versatile Traffic Video Analytics

AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation

T$^\text{3}$SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms

DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Dual Guidance Semi-Supervised Action Detection

VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO

The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Exploring the Application of Visual Question Answering (VQA) for Classroom Activity Monitoring

Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

Explainable Deep Anomaly Detection with Sequential Hypothesis Testing for Robotic Sewer Inspection

Anomalous Samples for Few-Shot Anomaly Detection

Built with on top of