The field of video understanding and reasoning is moving towards more fine-grained and temporal approaches. Researchers are exploring ways to enhance video temporal understanding, decompose videos into non-overlapping events, and model causal dependencies. Multimodal optimization frameworks are being proposed to combine text enhancement and multi-hop temporal graph modeling, leading to improved video parsing and understanding. The use of retrieval mechanisms and compositional reasoning over graphs is also gaining attention, enabling more accurate understanding of long videos and complex queries. Noteworthy papers include TEMPURA, which proposes a two-stage training framework for video temporal understanding, and TeMTG, which introduces a multimodal optimization framework for audio-visual video parsing. Other notable papers are RAVU, which proposes a retrieval augmented video understanding framework, and DyGEnc, which introduces a novel method for encoding a sequence of textual scene graphs to reason and answer questions in dynamic scenes.