Visual Social Inference and Scene Understanding

The field of visual social inference and scene understanding is moving towards a deeper understanding of how humans interpret and understand social cues from visual information. Recent research has highlighted the importance of explicit representations of 3D pose and structured visuospatial primitives in supporting human-like social scene understanding.

Noteworthy papers include: Spot The Ball, which introduces a benchmark for evaluating visual social inference in vision-language models and reveals a persistent human-model gap in visual social reasoning. Simple 3D Pose Features Support Human and Machine Social Scene Understanding, which provides strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.

Sources

Spot The Ball: A Benchmark for Visual Social Inference

Context informs pragmatic interpretation in vision-language models

Simple 3D Pose Features Support Human and Machine Social Scene Understanding

Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)

Built with on top of