The field of image and scene editing is rapidly advancing, with a focus on developing controllable and efficient methods for editing and generating high-quality images and scenes. Recent research has explored the use of diffusion models, latent space editing, and multimodal vision-language models to achieve this goal. Notable papers in this area include FLUX.1 Kontext, which presents a generative flow matching model that unifies image generation and editing in latent space, and FOCUS, a unified vision-language model that integrates segmentation-aware perception and controllable object-centric generation. Additionally, papers like Inverse-and-Edit and EditP23 have introduced innovative methods for image editing, including cycle consistency models and propagation of image prompts to multi-view representations. Other papers, such as SceneCrafter and Generative Blocks World, have demonstrated the effectiveness of controllable editing methods for driving scenes and 3D scene manipulation. PrITTI has shown promising results in generating controllable and editable 3D semantic scenes using latent diffusion-based frameworks. MADrive has introduced a memory-augmented reconstruction framework for driving scene modeling, enabling photorealistic synthesis of significantly altered or novel driving scenarios. Overall, these advances have the potential to enable more efficient, controllable, and high-quality image and scene editing, with applications in a variety of fields, including computer vision, robotics, and autonomous vehicles.