The field of video generation and understanding is rapidly advancing, with a focus on developing more controllable and customizable models. Current research is concentrated on improving the ability to generate high-quality videos with precise control over camera trajectories and object motion. Additionally, there is a growing interest in unified models that can perform both video understanding and generation tasks. These models aim to bridge the gap between image and video processing, enabling more efficient and effective video editing and generation. Noteworthy papers in this area include LiON-LoRA, which proposes a novel framework for controllable spatial and temporal generation, and Tora2, which introduces a motion and appearance customized diffusion transformer for multi-entity video generation. Omni-Video presents a unified framework for video understanding, generation, and instruction-based editing, while FIFA proposes a unified faithfulness evaluation framework for text-to-video and video-to-text generation. PromptTea introduces a prompt-complexity-aware caching method to improve inference speed in video generation.