The field of text-to-image generation is rapidly evolving, with a focus on improving the quality and control of generated images. Recent developments have centered around enhancing the personalization of text-to-image diffusion models, allowing for more diverse and accurate image generation. Additionally, there has been a push towards developing more effective methods for detecting and preventing the generation of Not Safe For Work (NSFW) content.
Noteworthy papers in this area include: LAMIC, which introduces a layout-aware multi-image composition framework that achieves state-of-the-art performance in controllable image synthesis. Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps to enable early detection without waiting for full image generation. YOLO-Count, a differentiable open-vocabulary object counting model that enables precise quantity control for text-to-image generation. UNCAGE, a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects.