Multimodal Understanding and Generation

The field of multimodal understanding and generation is rapidly advancing, with a focus on developing unified models that can jointly understand and generate diverse content. Recent developments have led to the creation of models that can perform tasks such as image understanding, object grounding, image editing, and high-resolution text-to-image synthesis. These models are achieving state-of-the-art performance on a wide range of benchmarks and are showing emergent capabilities such as zero-shot learning and visual reasoning. Notable papers in this area include Diff-Feat, which introduces a simple but powerful framework for extracting intermediate features from pre-trained diffusion-Transformer models, and Lavida-O, which proposes a unified Masked Diffusion Model for multimodal understanding and generation. EditVerse is another noteworthy model, which unifies image and video editing and generation within a single model, achieving state-of-the-art performance and exhibiting emergent editing and generation abilities across modalities.

Sources

Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

The Describe-Then-Generate Bottleneck: How VLM Descriptions Alter Image Generation Outcomes

Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation

AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Video models are zero-shot learners and reasoners

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning