Multimodal Learning and Generation

The field of multimodal learning and generation is rapidly advancing, with a focus on developing models that can effectively integrate and process multiple forms of data, such as text, images, and videos. A key direction in this area is the development of models that can learn to represent and generate data in a multimodal setting, such as multimodal entity linking, text-to-image synthesis, and video generation. These models have the potential to enable a wide range of applications, from image and video generation to natural language processing and human-computer interaction. Notable papers in this area include PGMEL, which proposes a policy gradient-based generative adversarial network for multimodal entity linking, and TIT-Score, which introduces a novel zero-shot metric for evaluating long-prompt-based text-to-image generation. Other noteworthy papers include Med-K2N, which proposes a flexible K-to-N modality translation framework for medical image synthesis, and MonSTeR, which introduces a unified model for motion, scene, and text retrieval.

Sources

PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking

TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency

Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis

Flip Distribution Alignment VAE for Multi-Phase MRI Synthesis

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Bridging the Gap Between Multimodal Foundation Models and World Models

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

CodeFormer++: Blind Face Restoration Using Deformable Registration and Deep Metric Learning

SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator

ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model

ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation