Advances in Text-to-Image Synthesis and Multimodal Control

The field of text-to-image synthesis is moving towards more controllable and customizable image generation. Recent developments have focused on improving the balance between subject fidelity and text alignment, as well as enabling more precise spatial control over the generated images. This is achieved through the introduction of novel concepts such as negative attention and the use of unified multimodal frameworks that embed layout coordinates directly into language prompts. Another area of focus is on addressing conflicts between different input sources, such as text prompts and conditioning images, to improve the overall quality and coherence of the generated images. Noteworthy papers include: MINDiff, which proposes a mask-integrated negative attention mechanism to mitigate overfitting in text-to-image personalization. ConsistCompose, which presents a unified multimodal framework for layout-controlled multi-instance image generation. BideDPO, which introduces a bidirectionally decoupled Direct Preference Optimization framework to resolve conflicts between text and condition signals. MultiID, which proposes a training-free approach for multi-ID customization via attention adjustment and spatial control. Canvas-to-Image, which consolidates heterogeneous controls into a single canvas interface for compositional image generation with multimodal controls.

Advances in Text-to-Image Synthesis and Multimodal Control

Sources