Text-Guided Image Editing

The field of text-guided image editing is rapidly advancing, with a focus on developing methods that can precisely edit images while preserving their original content. Recent developments have led to the creation of novel frameworks and techniques that enable fast, training-free, and mask-free image editing. These methods leverage advancements in diffusion models, transformers, and attention mechanisms to achieve high-quality editing results. Notable advancements include the ability to perform large-scale shape transformations, precise color control, and suppression of unwanted content.

Some noteworthy papers in this area include: InstantEdit, which proposes a fast text-guided image editing method based on the RectifiedFlow framework. CannyEdit, which introduces a novel training-free framework that addresses the challenges of balancing text adherence, context fidelity, and editing seamlessness. Exploring Multimodal Diffusion Transformers, which systematically analyzes the attention mechanism of multimodal diffusion transformers and proposes a robust prompt-based image editing method. Follow-Your-Shape, which proposes a training-free and mask-free framework for precise and controllable editing of object shapes. Training-Free Text-Guided Color Editing, which presents a method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers for accurate and consistent color editing. Dual Recursive Feedback, which proposes a training-free system that properly reflects control conditions in controllable text-to-image diffusion models. Translation of Text Embedding, which introduces a novel approach to directly suppress strongly entangled content within the text embedding space of diffusion models. NanoControl, which proposes a lightweight framework for precise and efficient control in diffusion transformers. TweezeEdit, which proposes a tuning- and inversion-free framework for consistent and efficient image editing. CountCluster, which proposes a method that guides the object cross-attention map to be clustered according to the specified object count in the input.

Text-Guided Image Editing

Sources