Advances in Text-to-Image Synthesis and Multimodal Control

The field of text-to-image synthesis is moving towards more controllable and customizable image generation. Recent developments have focused on improving the balance between subject fidelity and text alignment, as well as enabling more precise spatial control over the generated images. This is achieved through the introduction of novel concepts such as negative attention and the use of unified multimodal frameworks that embed layout coordinates directly into language prompts. Another area of focus is on addressing conflicts between different input sources, such as text prompts and conditioning images, to improve the overall quality and coherence of the generated images. Noteworthy papers include: MINDiff, which proposes a mask-integrated negative attention mechanism to mitigate overfitting in text-to-image personalization. ConsistCompose, which presents a unified multimodal framework for layout-controlled multi-instance image generation. BideDPO, which introduces a bidirectionally decoupled Direct Preference Optimization framework to resolve conflicts between text and condition signals. MultiID, which proposes a training-free approach for multi-ID customization via attention adjustment and spatial control. Canvas-to-Image, which consolidates heterogeneous controls into a single canvas interface for compositional image generation with multimodal controls.

Sources

MINDiff: Mask-Integrated Negative Attention for Controlling Overfitting in Text-to-Image Personalization

ConsistCompose: Unified Multimodal Layout Control for Image Composition

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Built with on top of