Compositional Text-to-Image Generation

The field of text-to-image generation is moving towards more compositional and hierarchical approaches. Recent works have focused on improving the accuracy and coherence of generated images, particularly in cases involving complex scenes and multiple objects. The use of reinforcement learning, diffusion models, and optimization-based methods has shown promising results in enhancing compositional generation capabilities. Notably, researchers are exploring novel curriculum learning frameworks, proximal diffusion models, and hierarchical generative frameworks to address the challenges of compositional text-to-image generation. Noteworthy papers include:

One paper proposes a novel compositional curriculum reinforcement learning framework that leverages scene graphs to establish a difficulty criterion for compositional ability.
Another paper develops a text-to-image diffusion model based on backward discretizations and conditional proximal operators, achieving state-of-the-art results while requiring lower compute and smaller model size.
A third paper introduces a hierarchical compositional generative framework that decomposes complex prompts into minimal semantic units and synthesizes them iteratively, ensuring faithful construction of textual concepts into the final scene.

Compositional Text-to-Image Generation

Sources