Open-World Visual Understanding

The field of visual understanding is moving towards open-world scenarios, where models are expected to generalize to novel objects, relationships, and interactions without requiring large amounts of annotated data. Recent works have exploited the potential of large language models (LLMs) and vision language models (VLMs) to enable zero-shot learning and few-shot learning capabilities, allowing for more efficient and scalable visual understanding. Noteworthy papers include:

  • Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection, which introduces an iterative visual grounding framework that leverages LLMs as structured relational priors.
  • Open World Scene Graph Generation using Vision Language Models, which proposes a training-free framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning.
  • ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations, which presents a self-refining framework for open-world object labeling using LLMs and visual embeddings.

Sources

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Open World Scene Graph Generation using Vision Language Models

ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations

BakuFlow: A Streamlining Semi-Automatic Label Generation Tool

Built with on top of