Reasoning-Augmented Text-to-Image Generation

The field of text-to-image generation is moving towards incorporating explicit reasoning and multimodal large language models to improve the fidelity and compositional generalization of generated images. Recent developments have focused on introducing reasoning into the prompt enhancement process, generative image editing, and interpretable evaluation methods. These advancements have enabled end-to-end training without human-annotated data and have achieved state-of-the-art results in various benchmarks. Notable papers in this area include RePrompt, which proposes a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process, and R-Genie, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. Other noteworthy papers include T2I-Eval-R1, which proposes a reinforcement learning framework for training open-source multimodal large language models, and Text2Grad, which introduces a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients.

Reasoning-Augmented Text-to-Image Generation

Sources