Advances in AI-Generated Video Evaluation and Generation

The field of AI-generated video evaluation and generation is rapidly advancing, with a focus on developing more robust and interpretable evaluation frameworks. Recent work has prioritized the creation of datasets and models that can provide multi-aspect feedback and align with human preferences. This shift has enabled the development of more accurate and human-like video generation, with applications in areas such as video captioning and text-to-video synthesis. Notably, the integration of vision-language models and novel loss functions has improved the performance of video evaluation models. Furthermore, the creation of large-scale datasets that support the modeling of coherent multi-clip video sequences has facilitated the generation of story-driven content with smooth visual transitions. Noteworthy papers include:

  • AIGVE-MACS, which introduced a unified model for AI-generated video evaluation that provides numerical scores and multi-aspect language comment feedback, achieving state-of-the-art performance in scoring correlation and comment quality.
  • AVC-DPO, which proposed a post-training framework to enhance captioning capabilities in video multimodal large language models through preference alignment, achieving exceptional performance in the LOVE@CVPR'25 Workshop Track 1A: Video Detailed Captioning Challenge.
  • CI-VID, which introduced a dataset that enables models to produce coherent, multi-scene video sequences, and demonstrated significant improvements in accuracy and content consistency when generating video sequences.
  • SynTVA, which introduced a new dataset and benchmark to evaluate the utility of synthetic videos for building retrieval models, and showed that synthetic videos can be a valuable asset for dataset augmentation and improving text-to-video retrieval outcomes.

Sources

AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation

AVC-DPO: Aligned Video Captioning via Direct Preference Optimization

CI-VID: A Coherent Interleaved Text-Video Dataset

Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

Built with on top of