The field of multimodal generation and understanding is rapidly advancing, with a focus on developing unified frameworks that can seamlessly integrate multiple tasks and modalities. Recent research has explored the use of diffusion models, autoregressive models, and discrete generative models to achieve this goal. These models have shown promise in generating high-quality images, text, and other modalities, as well as understanding and representing complex visual concepts. Noteworthy papers in this area include D2C, which proposes a novel two-stage method for continuous autoregressive image generation, and CoSimGen, which presents a controllable diffusion model for simultaneous image and mask generation. Other papers, such as MMGen and Unified Multimodal Discrete Diffusion, have introduced unified frameworks for multimodal generation and understanding, demonstrating impressive performance across a range of tasks and datasets.