Multimodal Models for Front-end Engineering and Beyond

The field of multimodal models is rapidly advancing, with a focus on front-end engineering, recommendation systems, and unified perception and generation. Recent developments have led to the creation of comprehensive benchmarks, such as DesignBench, which evaluate multimodal large language models (MLLMs) on various front-end tasks, including generation, editing, and repair. Notable papers in this area include DesignBench, which provides a multi-framework, multi-task evaluation benchmark for assessing MLLMs' capabilities in automated front-end engineering. Ming-Omni is a unified multimodal model that can process images, text, audio, and video, demonstrating strong proficiency in both speech and image generation. Pisces is an auto-regressive multimodal foundation model that addresses the challenge of developing unified models for image understanding and generation through a novel decoupled visual encoding architecture and tailored training techniques.

Sources

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Serendipitous Recommendation with Multimodal LLM

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Ming-Omni: A Unified Multimodal Model for Perception and Generation

MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed

MLLM-Based UI2Code Automation Guided by UI Layout Information

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Built with on top of