Advances in Multimodal Modeling and Protein Generation

The field of multimodal modeling is rapidly advancing, with a focus on developing more efficient and effective models for tasks such as image generation, understanding, and manipulation. Recent research has explored the use of novel architectures, such as two-end-separated and middle-shared designs, to improve the performance of unified multimodal models. Additionally, there is a growing interest in incorporating physical and biological priors into generative models, particularly in the context of protein structure prediction and generation. Notable papers in this area include SpecMER, which introduces a speculative decoding framework for fast protein generation, and UniAlignment, which proposes a unified multimodal generation framework for tasks such as image understanding and manipulation. Other noteworthy papers include HieraTok, which presents a multi-scale visual tokenizer for image reconstruction and generation, and MarS-FM, which introduces a generative model for molecular dynamics simulation. Overall, the field is moving towards more integrated and physically-informed models that can effectively capture the complexities of multimodal data.

Sources

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

UI-UG: A Unified MLLM for UI Understanding and Generation

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Let Physics Guide Your Protein Flows: Topology-aware Unfolding and Generation

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

Flow Autoencoders are Effective Protein Tokenizers

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Accelerating Long-Term Molecular Dynamics with Physics-Informed Time-Series Forecasting

Growing Visual Generative Capacity for Pre-Trained MLLMs