The field of vision-language models is moving towards improving compositional understanding and temporal alignment. Researchers are focusing on developing new benchmarks and frameworks that can evaluate and enhance model capabilities in achieving fine-grained, temporally coherent video-text alignment. Noteworthy papers in this area include VideoComp, which introduces a benchmark and learning framework for advancing video-text compositionality understanding, and SCRAMBLe, which proposes an approach for preference tuning open-weight MLLMs on synthetic preference data to improve compositional reasoning capabilities. SVLTA and Human-like compositional learning of visually-grounded concepts using synthetic environments are also notable, as they investigate and evaluate the ability of models to achieve alignment from a temporal perspective and demonstrate the efficacy of human-like learning strategies in improving artificial systems' learning efficiency.