Open-World Visual Understanding

The field of visual understanding is moving towards open-world scenarios, where models are expected to generalize to novel objects, relationships, and interactions without requiring large amounts of annotated data. Recent works have exploited the potential of large language models (LLMs) and vision language models (VLMs) to enable zero-shot learning and few-shot learning capabilities, allowing for more efficient and scalable visual understanding. Noteworthy papers include:

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection, which introduces an iterative visual grounding framework that leverages LLMs as structured relational priors.
Open World Scene Graph Generation using Vision Language Models, which proposes a training-free framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning.
ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations, which presents a self-refining framework for open-world object labeling using LLMs and visual embeddings.

Open-World Visual Understanding

Sources