Efficient Video Understanding and Processing

The field of video understanding and processing is rapidly advancing, with a focus on developing efficient and effective methods for analyzing and interpreting video data. Recent research has centered on improving the performance of video large language models (VideoLLMs) and vision transformers (ViTs) while reducing computational costs and memory usage. Notable advancements include the development of novel token compression techniques, such as adaptive token merging and temporal causal video tokenization, which enable the reduction of visual tokens without compromising performance. Additionally, researchers have proposed innovative architectures, including dual cross-attention mechanisms and context-aware large language models, to enhance video understanding and processing capabilities. Some noteworthy papers in this area include StPR, which introduces a unified and exemplar-free framework for video class-incremental learning, and VidCom2, which proposes a plug-and-play inference acceleration framework for VideoLLMs. Other notable works include Flashback, which presents a zero-shot and real-time video anomaly detection paradigm, and LiveVLM, which enables efficient online video understanding via streaming-oriented KV cache and retrieval.

Sources

StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning

Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Clapper: Compact Learning and Video Representation in VLMs

Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM

CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Built with on top of