The field of text-to-image diffusion models is rapidly evolving, with a focus on improving the alignment between generated images and input prompts. Recent developments have centered around addressing the challenges of visual hallucinations, multimodal preference optimization, and inference-time alignment. Notably, researchers have proposed innovative methods to mitigate visual hallucinations, such as Semantic Curriculum Preference Optimization and Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization. Additionally, there have been significant advancements in plug-and-play prompt refinement, latent feedback, and listwise preference optimization. These advancements have the potential to enhance the overall performance and controllability of text-to-image diffusion models.
Some noteworthy papers in this area include: SemanticControl, which proposes a training-free approach for handling loosely aligned visual conditions in ControlNet. REFINE-CONTROL, which introduces a semi-supervised distillation framework for conditional image generation. MISP-DPO, which incorporates multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. CO3, which improves multi-concept prompt fidelity in text-to-image diffusion models through a corrective sampling strategy. IMG, which proposes a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. MIRA, which introduces an image-space, score-based KL surrogate to regularize the sampling trajectory and prevent reward hacking.