Advances in Multimodal Generation and Understanding

The field of multimodal generation and understanding is rapidly advancing, with a focus on developing unified frameworks that can seamlessly integrate multiple tasks and modalities. Recent research has explored the use of diffusion models, autoregressive models, and discrete generative models to achieve this goal. These models have shown promise in generating high-quality images, text, and other modalities, as well as understanding and representing complex visual concepts. Noteworthy papers in this area include D2C, which proposes a novel two-stage method for continuous autoregressive image generation, and CoSimGen, which presents a controllable diffusion model for simultaneous image and mask generation. Other papers, such as MMGen and Unified Multimodal Discrete Diffusion, have introduced unified frameworks for multimodal generation and understanding, demonstrating impressive performance across a range of tasks and datasets.

Sources

D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens

Enhancing Graphical Lasso: A Robust Scheme for Non-Stationary Mean Data

CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

Latent Beam Diffusion Models for Decoding Image Sequences

MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Unified Multimodal Discrete Diffusion

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning

Critical Iterative Denoising: A Discrete Generative Model Applied to Graphs

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Built with on top of