The field of embodied AI is moving towards improving spatial reasoning capabilities, with a focus on developing robust spatial representations from sequential visual input. Recent research has highlighted the challenges that multimodal large language models (MLLMs) face in this regard, including limitations in object permanence, spatial relationships, and numerical tracking. To address these challenges, new benchmarks and evaluation metrics have been proposed, such as REM and ReMindView-Bench, which provide a systematic way to assess the spatial reasoning capabilities of MLLMs. Additionally, innovative approaches have been introduced, including the use of 3D spatial memory and geometric information to augment MLLM-based spatial understanding. Noteworthy papers in this area include: REM, which introduces a benchmark for evaluating embodied spatial reasoning capabilities of MLLMs. ReMindView-Bench, which presents a cognitively grounded benchmark for evaluating multi-view visual spatial reasoning in VLMs. 3DSPMR, which proposes a 3D spatial memory reasoning approach for sequential embodied tasks. ReasonX, which leverages a multimodal large language model as a perceptual judge for intrinsic image decomposition. CroPond, which achieves state-of-the-art performance on the CrossPoint-Bench dataset for cross-view point correspondence.