The field of 3D vision-language understanding is rapidly advancing, with a focus on developing more effective models for tasks such as 3D medical image understanding, spatial reasoning, and cross-view object geo-localization. Researchers are exploring new architectures and training methods to improve the performance of vision-language models in these areas. One key trend is the use of multi-modal learning, which combines visual and language inputs to achieve better results. Another trend is the development of more sophisticated positional encoding schemes, which can capture both spatial coordinates and object shape information. These advances have the potential to enable a wide range of applications, including automated report generation, text-conditioned 3D image synthesis, and robust cross-view object geo-localization. Notable papers in this area include: REALM, which introduces an innovative MLLM-agent framework for open-world reasoning-based segmentation, achieving remarkable performance in interpreting both explicit and implicit instructions. BTB3D, which presents a causal convolutional encoder-decoder that unifies 2D and 3D training and inference, producing compact, frequency-aware volumetric tokens and setting a new state-of-the-art on key tasks such as report generation and text-to-CT synthesis.