Advances in Human-Object Interaction Understanding and Robotics

The field of human-object interaction understanding and robotics is rapidly advancing, with a focus on developing more scalable and data-efficient methods for learning from human demonstrations. Recent work has highlighted the importance of leveraging large-scale human manipulation videos and monocular internet videos to improve robot learning and object pose estimation.

Notable papers in this area have introduced novel approaches for extracting manipulation trajectories, reconstructing 4D human-object interaction data, and generating realistic hand-object interaction videos. These advances have the potential to enable more robust and generalized robot learning, as well as improved understanding of human-object interactions.

Some particularly noteworthy papers include: Learning from Watching, which proposed a novel approach for extracting dense trajectories of task-relevant keypoints during manipulation, enabling more comprehensive utilization of internet-scale human demonstration videos. Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction, which introduced a novel optimization framework for constraining the ill-posed 4D HOI reconstruction problem, and a new large-scale 4D HOI dataset featuring a diverse catalog of object types and actions. Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation, which proposed a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations, enabling the model to learn fine-grained interaction physics and generalize to open-world scenarios. SpriteHand, which presented an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. DF-Mamba, which proposed an effective and efficient framework for visual feature extraction in 3D hand pose estimation using recent state space modeling, dubbed Deformable Mamba, which captures global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. RoboWheel, which introduced a data engine that converts human hand object interaction videos into training-ready supervision for cross morphology robotic learning, and demonstrated that trajectories produced by the pipeline are as stable as those from teleoperation and yield comparable continual performance gains. Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing, which proposed a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account, and demonstrated an average of 11.7% improvement in reconstruction accuracy. Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation, which presented a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction, and enables researchers to evaluate how well methods transfer between human and robotic viewpoints. Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints, which leveraged two extra sources of information, generative models and contact information, to reduce the ambiguity of vision signals and improve object reconstruction under occlusion.

Sources

Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos

Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

Is Image-based Object Pose Estimation Ready to Support Grasping?

SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation

DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning

Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Built with on top of