The field of video understanding and quality assessment is moving towards more comprehensive and realistic evaluations, with a focus on multimodal large language models (MLLMs) and their applications in various tasks such as video question answering, temporal grounding, and anomaly detection. Researchers are developing new benchmarks and datasets to address the limitations of existing ones, including the lack of real-world user-generated content and the need for more diverse and challenging scenarios. Notable papers in this area include CHUG, which introduces a large-scale subjective study on user-generated HDR video quality, and OmniVideoBench, which provides a rigorously designed benchmark for assessing synergistic audio-visual understanding. Other papers, such as SeqBench and CausalVerse, focus on evaluating sequential narrative coherence in text-to-video generation and causal representation learning, respectively. Overall, the field is advancing towards more accurate and generalizable models that can handle complex and dynamic video content.