The field of video understanding and anomaly detection is rapidly evolving, with a focus on developing more accurate and efficient models for real-world applications. Recent research has explored the use of multimodal large language models (MLLMs) and large language models (LLMs) to improve video anomaly detection, enabling the identification and grounding of anomalous behaviors or events in videos.
Notable advancements include the development of dual-branch architectures, such as the Dual-Branch Adaptive Multiscale Spatiotemporal Framework (DAMS), which integrates hierarchical feature learning and complementary information to efficiently detect anomalies. Additionally, the introduction of test-time training and difficulty-aware group relative policy optimization (GRPO) has enhanced the performance of MLLMs in industrial anomaly detection.
Other innovative approaches include the use of visual question answering (VQA) models for classroom activity monitoring, and the application of explainable deep learning anomaly detection with sequential hypothesis testing for robotic sewer inspection.
Some particularly noteworthy papers in this area include the proposal of AF-CLIP, which dramatically enhances visual representations to focus on local defects, and the introduction of VAGU, the first benchmark to integrate both video anomaly grounding and understanding tasks. The EMIT framework, which enhances MLLMs for industrial anomaly detection via difficulty-aware GRPO, also demonstrates significant improvements in performance.