Advances in Video Understanding and Object Detection

The field of video understanding and object detection is moving towards more realistic and challenging scenarios. Researchers are focusing on developing methods that can handle complex interactions, multiple moments, and fine-grained semantics. One of the key directions is the development of multi-moment retrieval methods, which can retrieve multiple relevant moments in a video given a single query. Another important area is the advancement of open-vocabulary object detection models, which can detect objects from arbitrary text queries. These models are being improved to handle specialized domains and fine-grained classes. Additionally, there is a growing interest in training-free methods, which can adapt to new scenarios without requiring large amounts of annotated data. Noteworthy papers in this area include: When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions, which introduces a new dataset and evaluation metrics for multi-moment retrieval. Closed-Loop Transfer for Weakly-supervised Affordance Grounding, which proposes a novel closed-loop framework for transferring knowledge between exocentric and egocentric images. On-the-Fly OVD Adaptation with FLAME, which presents a cascaded approach for adapting open-vocabulary object detection models to specialized domains. A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition, which leverages EfficientNet and CLIP for unsupervised segmentation and open-vocabulary object recognition. Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation, which refines attention maps for video reasoning segmentation using contrastive object-background fusion and complementary video-frame fusion. Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning, which proposes a zero-external-dependency framework for moment retrieval that resolves ambiguous boundary information and semantic confusion. DMC^3: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering, which contains an egocentric video QA baseline, a counterfactual sample construction module, and a counterfactual sample-involved contrastive optimization.

Sources

When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

Closed-Loop Transfer for Weakly-supervised Affordance Grounding

On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration

A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering

Built with on top of