Egocentric Vision and Interaction

The field of egocentric vision and interaction is rapidly advancing, with a focus on developing more accurate and robust models for tracking, recognition, and understanding of human behavior and interactions. Recent research has highlighted the importance of considering the challenges of egocentric settings, such as high-intensity motions, dynamic occlusions, and low-textured areas, and has proposed new datasets and benchmarks to address these challenges. Notable papers in this area include the Monado SLAM dataset, which provides a set of real sequences taken from multiple virtual reality headsets, and EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. Other notable papers include EgoPrompt, which proposes a prompt learning-based framework for egocentric action recognition, and Hierarchical Event Memory, which introduces a hierarchical event memory for accurate and low-latency online video temporal grounding. These papers demonstrate significant advancements in the field and provide valuable resources and insights for future research.

Sources

The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

Fine-grained Spatiotemporal Grounding on Egocentric Videos

Multi-human Interactive Talking Dataset

EgoPrompt: Prompt Pool Learning for Egocentric Action Recognition

Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired

Length Matters: Length-Aware Transformer for Temporal Sentence Grounding

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Built with on top of