Advancements in Diffusion Models for Text and Image Generation

The field of diffusion models is experiencing significant growth, with a focus on improving the quality and efficiency of text and image generation. Researchers are exploring new architectures and techniques to enhance the performance of diffusion models, including the integration of latent variable modeling and the use of sparse diffusion transformers. These innovations have led to state-of-the-art results in various tasks, such as text-to-image synthesis and multimodal generation. Notably, the development of unified models that can handle multiple tasks and modalities is gaining traction, with promising results in terms of quality and efficiency. Noteworthy papers in this area include VADD, which introduces a novel framework for discrete diffusion with latent variable modeling, and One-Way Ticket, which proposes a time-independent unified encoder for distilling text-to-image diffusion models. Additionally, Muddit presents a unified discrete diffusion transformer that enables fast and parallel generation across text and image modalities, while OpenUni provides a simple and lightweight baseline for unified multimodal understanding and generation.

Sources

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes

PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Built with on top of