Human-Object Interaction Detection

The field of human-object interaction (HOI) detection is moving towards more fine-grained understanding of interactions between humans and objects in videos. This includes developing new methods for detecting spatial-temporal human-object interactions and improving the evaluation protocols for HOI detection. Researchers are also exploring the use of vision-language models (VLMs) for HOI detection and developing new benchmarks that can accommodate both VLMs and specialized HOI methods. Furthermore, there is a growing interest in developing interactive systems that can request human assistance when needed, which has potential safety implications for mobile agents. Noteworthy papers include:

A paper that proposes a new instance-level human-object interaction detection task on videos and constructs a dataset for evaluation.
A paper that introduces a new benchmark for HOI detection that reformulates the task as a multiple-answer multiple-choice task, enabling direct comparison between VLMs and HOI-specific methods.
A paper that proposes a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries for DETR-based HOI detection.
A paper that develops an interactive system that actively seeks human confirmation at critical decision points using reinforcement learning.

Human-Object Interaction Detection

Sources