Advances in Text-to-Image Diffusion Models

The field of text-to-image diffusion models is rapidly evolving, with a focus on improving the alignment between generated images and input prompts. Recent developments have centered around addressing the challenges of visual hallucinations, multimodal preference optimization, and inference-time alignment. Notably, researchers have proposed innovative methods to mitigate visual hallucinations, such as Semantic Curriculum Preference Optimization and Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization. Additionally, there have been significant advancements in plug-and-play prompt refinement, latent feedback, and listwise preference optimization. These advancements have the potential to enhance the overall performance and controllability of text-to-image diffusion models.

Some noteworthy papers in this area include: SemanticControl, which proposes a training-free approach for handling loosely aligned visual conditions in ControlNet. REFINE-CONTROL, which introduces a semi-supervised distillation framework for conditional image generation. MISP-DPO, which incorporates multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. CO3, which improves multi-concept prompt fidelity in text-to-image diffusion models through a corrective sampling strategy. IMG, which proposes a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. MIRA, which introduces an image-space, score-based KL surrogate to regularize the sampling trajectory and prevent reward hacking.

Sources

SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet

REFINE-CONTROL: A Semi-supervised Distillation Method For Conditional Image Generation

Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

CO3: Contrasting Concepts Compose Better

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment

Towards Better Optimization For Listwise Preference in Diffusion Models

MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

Built with on top of