Human-Object Interaction Detection

The field of human-object interaction (HOI) detection is moving towards more fine-grained understanding of interactions between humans and objects in videos. This includes developing new methods for detecting spatial-temporal human-object interactions and improving the evaluation protocols for HOI detection. Researchers are also exploring the use of vision-language models (VLMs) for HOI detection and developing new benchmarks that can accommodate both VLMs and specialized HOI methods. Furthermore, there is a growing interest in developing interactive systems that can request human assistance when needed, which has potential safety implications for mobile agents. Noteworthy papers include:

  • A paper that proposes a new instance-level human-object interaction detection task on videos and constructs a dataset for evaluation.
  • A paper that introduces a new benchmark for HOI detection that reformulates the task as a multiple-answer multiple-choice task, enabling direct comparison between VLMs and HOI-specific methods.
  • A paper that proposes a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries for DETR-based HOI detection.
  • A paper that develops an interactive system that actively seeks human confirmation at critical decision points using reinforcement learning.

Sources

Spatial-Temporal Human-Object Interaction Detection

Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Built with on top of