Reasoning-Augmented Text-to-Image Generation

The field of text-to-image generation is moving towards incorporating explicit reasoning and multimodal large language models to improve the fidelity and compositional generalization of generated images. Recent developments have focused on introducing reasoning into the prompt enhancement process, generative image editing, and interpretable evaluation methods. These advancements have enabled end-to-end training without human-annotated data and have achieved state-of-the-art results in various benchmarks. Notable papers in this area include RePrompt, which proposes a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process, and R-Genie, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. Other noteworthy papers include T2I-Eval-R1, which proposes a reinforcement learning framework for training open-source multimodal large language models, and Text2Grad, which introduces a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients.

Sources

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

R-Genie: Reasoning-Guided Generative Image Editing

T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

Text2Grad: Reinforcement Learning from Natural Language Feedback

Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation

Built with on top of