Advances in Video Understanding and Retrieval

The field of video understanding and retrieval is rapidly advancing, with a focus on developing more effective and efficient methods for analyzing and retrieving video content. Recent research has explored new approaches to video denoising, sports video analysis, and referring video object segmentation, among other areas. A key trend in this field is the use of innovative architectures and techniques, such as diffusion-based models and transformer-based architectures, to improve the accuracy and robustness of video analysis and retrieval systems. Notable papers in this area include: Denoise-then-Retrieve Network, which introduces a denoise-then-retrieve paradigm for video moment retrieval. TrajSV, a trajectory-based framework for sports video representations and applications, which achieves state-of-the-art performance in sports video retrieval. SAMDWICH, a moment-aware RVOS framework that leverages aligned text-to-clip pairs to guide training and improve referential understanding. Generic Event Boundary Detection via Denoising Diffusion, which introduces a novel diffusion-based boundary detection model that tackles the problem of GEBD from a generative perspective. Bridging the Gap, which designs an approach that transfers singles-trained models to doubles analysis in badminton. Temporal-Conditional Referring Video Object Segmentation, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Beyond Simple Edits, which introduces a novel dataset and model for composed video retrieval with dense modifications. Repeating Words for Video-Language Retrieval, which proposes a novel framework to learn fine-grained features for better alignment and introduces an inference pipeline to improve performance without additional training. Aligning Moments in Time using Video Queries, which introduces a transformer-based model designed to capture semantic context and temporal details necessary for precise moment localization.

Sources

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

TrajSV: A Trajectory-based Model for Sports Video Representations and Applications

SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation

Generic Event Boundary Detection via Denoising Diffusion

Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Aligning Moments in Time using Video Queries

Built with on top of