The field of AI-generated video evaluation and generation is rapidly advancing, with a focus on developing more robust and interpretable evaluation frameworks. Recent work has prioritized the creation of datasets and models that can provide multi-aspect feedback and align with human preferences. This shift has enabled the development of more accurate and human-like video generation, with applications in areas such as video captioning and text-to-video synthesis. Notably, the integration of vision-language models and novel loss functions has improved the performance of video evaluation models. Furthermore, the creation of large-scale datasets that support the modeling of coherent multi-clip video sequences has facilitated the generation of story-driven content with smooth visual transitions. Noteworthy papers include:
- AIGVE-MACS, which introduced a unified model for AI-generated video evaluation that provides numerical scores and multi-aspect language comment feedback, achieving state-of-the-art performance in scoring correlation and comment quality.
- AVC-DPO, which proposed a post-training framework to enhance captioning capabilities in video multimodal large language models through preference alignment, achieving exceptional performance in the LOVE@CVPR'25 Workshop Track 1A: Video Detailed Captioning Challenge.
- CI-VID, which introduced a dataset that enables models to produce coherent, multi-scene video sequences, and demonstrated significant improvements in accuracy and content consistency when generating video sequences.
- SynTVA, which introduced a new dataset and benchmark to evaluate the utility of synthetic videos for building retrieval models, and showed that synthetic videos can be a valuable asset for dataset augmentation and improving text-to-video retrieval outcomes.