The field of text-to-image synthesis is rapidly advancing, with a focus on improving control and coherence in generated images. Recent developments have explored the use of multimodal approaches, incorporating techniques such as attention mechanisms and large language models to enhance the quality and consistency of generated images. Notably, researchers have made significant progress in addressing challenges related to complex prompts, multiple objects, and style specifications.
Some of the key innovations include the use of local prompt adaptation, cross-attention mechanisms, and semantic evolution modules to improve layout control, stylistic consistency, and contextual coherence. Additionally, there have been advancements in cross-domain image composition, allowing for seamless and natural stylization without relying on text prompts.
Particularly noteworthy papers include:
- LLMControl, which introduces a framework for grounded control of text-to-image diffusion-based synthesis with multimodal large language models, achieving competitive synthesis quality and allowing for precise control over generated images.
- AIComposer, which presents a method for cross-domain image composition that does not require text prompts, preserving the diffusion prior and enabling stable stylization without a pre-stylization network.
- Chain-of-Cooking, which proposes a cooking process visualization model that generates coherent and semantically consistent images of cooking steps using a dynamic patch selection module and bidirectional chain-of-thought guidance.
- LOcalized Text and Sketch for fashion image generation (LOTS), which leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation, achieving state-of-the-art image generation performance on both global and localized metrics.