The field of visual social inference and scene understanding is moving towards a deeper understanding of how humans interpret and understand social cues from visual information. Recent research has highlighted the importance of explicit representations of 3D pose and structured visuospatial primitives in supporting human-like social scene understanding.
Noteworthy papers include: Spot The Ball, which introduces a benchmark for evaluating visual social inference in vision-language models and reveals a persistent human-model gap in visual social reasoning. Simple 3D Pose Features Support Human and Machine Social Scene Understanding, which provides strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.