The field of multimodal image generation and editing is rapidly evolving, with a focus on developing more sophisticated and controllable models. Recent research has explored the use of diffusion models, large language models, and vision-language models to improve the quality and diversity of generated images. Noteworthy papers in this area include FlexMUSE, which proposes a multimodal unification and semantics enhancement framework for creative writing, and JCo-MVTON, which introduces a jointly controllable multi-modal diffusion transformer for mask-free virtual try-on. Another significant contribution is the Instant Preference Alignment framework, which enables instant preference-aligned text-to-image generation in a real-time and training-free manner. Additionally, the All-in-One Slider module has been proposed for attribute manipulation in diffusion models, allowing for fine-grained control over various attributes. These advancements have the potential to revolutionize applications such as virtual try-on, image editing, and content creation.
Advances in Multimodal Image Generation and Editing
Sources
FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
Bias Amplification in Stable Diffusion's Representation of Stigma Through Skin Tones and Their Homogeneity
Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models