Emerging Trends in Visual Generation and Editing

The field of visual generation and editing is rapidly advancing, with a focus on improving efficiency, consistency, and precision. Recent developments have seen the introduction of novel frameworks and models that enable high-quality image generation, editing, and segmentation. Notably, the use of diffusion models, autoregressive models, and multimodal large language models has become increasingly prevalent. These models have demonstrated significant improvements in image generation, editing, and segmentation tasks, with some approaches achieving state-of-the-art results.

A key trend in this area is the development of more efficient and scalable models, such as DiffusionX and Generation then Reconstruction, which enable faster image generation and editing while maintaining high quality. Additionally, the use of caching mechanisms, such as Diffusion Caching, has been proposed to reduce computational overhead and improve inference-time scaling.

Another significant area of research is the development of more precise and consistent image editing methods, such as ConsistEdit and EditInfinity, which enable fine-grained editing and preservation of source image consistency. The introduction of new datasets, such as Pico-Banana-400K, has also facilitated research in this area by providing large-scale, high-quality datasets for training and benchmarking image editing models.

Some noteworthy papers in this area include NANO3D, which proposes a training-free approach for efficient 3D editing without masks, and BLIP3o-NEXT, which advances the state-of-the-art in native image generation. TokenAR is also notable for its simple yet effective token-level enhancement mechanism for multiple subject generation. Furthermore, DiffPlace introduces a conditional diffusion framework for simultaneous VLSI placement, and LENS proposes a plug-and-play solution for equipping multimodal large language models with pixel-level segmentation abilities.

Sources

NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks

BLIP3o-NEXT: Next Frontier of Native Image Generation

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

DiffPlace: A Conditional Diffusion Framework for Simultaneous VLSI Placement Beyond Sequential Paradigms

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

DiffusionX: Efficient Edge-Cloud Collaborative Image Generation with Multi-Round Prompt Evolution

Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

Inference-Time Compute Scaling For Flow Matching

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Learning and Simulating Building Evacuation Patterns for Enhanced Safety Design Using Generative Models

A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing

EditInfinity: Image Editing with Binary-Quantized Generative Models

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Built with on top of