The field of multimodal understanding and generation is rapidly advancing, with a focus on developing unified models that can jointly understand and generate diverse content. Recent developments have led to the creation of models that can perform tasks such as image understanding, object grounding, image editing, and high-resolution text-to-image synthesis. These models are achieving state-of-the-art performance on a wide range of benchmarks and are showing emergent capabilities such as zero-shot learning and visual reasoning. Notable papers in this area include Diff-Feat, which introduces a simple but powerful framework for extracting intermediate features from pre-trained diffusion-Transformer models, and Lavida-O, which proposes a unified Masked Diffusion Model for multimodal understanding and generation. EditVerse is another noteworthy model, which unifies image and video editing and generation within a single model, achieving state-of-the-art performance and exhibiting emergent editing and generation abilities across modalities.