The field of multimodal reasoning and video understanding is moving towards more robust and generalizable models. Recent developments have focused on improving the contextual understanding and temporal modeling of video-language models, enabling them to better capture complex real-world scenarios. A key direction is the integration of multiple heterogeneous models to leverage their strengths and address the limitations of individual models. Another area of focus is the development of more effective captioning systems, which can provide compact and informative representations of video content. These systems have the potential to enable more efficient and accurate video understanding, and to support applications such as video question answering and sign language recognition. Notable papers in this area include: Team of One, which proposes a novel framework for open-ended video question answering that enhances reasoning depth and robustness. MMS Player, which presents an open source software for parametric data-driven animation of Sign Language avatars. Controllable Hybrid Captioner, which explores ways to improve the quality of text-based memories for video understanding.