The field of video and 3D generation is rapidly advancing, with a focus on improving visual quality, spatial accuracy, and controllability. Recent developments have led to the creation of more realistic and detailed models, particularly in the context of autonomous driving and image editing. Researchers are exploring new approaches to fine-tune video generation models, balance visual fidelity with dynamic accuracy, and develop more efficient and flexible frameworks for image editing. Notable papers in this area include: PosBridge, which proposes a novel framework for inserting custom objects into target scenes, and ObjFiller-3D, which introduces a method for consistent multi-view 3D inpainting via video diffusion models. Other noteworthy papers are ROSE, which presents a framework for removing objects with side effects in videos, and VoxHammer, which proposes a training-free approach for precise and coherent 3D editing in native 3D space. Overall, these advancements have the potential to significantly impact various applications, including autonomous driving, video editing, and 3D modeling.