Advances in Multimodal Understanding and Generation

The field of multimodal understanding and generation is rapidly advancing, with a focus on developing models that can effectively process and generate multiple forms of data, such as text, images, and 3D models. Recent research has demonstrated the potential of large language models (LLMs) to be used as a foundation for multimodal understanding and generation, enabling applications such as 3D asset creation, vector graphics generation, and semantic editing of CAD objects. Notable papers in this area include DoorDet, which presents a semi-automated pipeline for constructing a multi-class door detection dataset, and UniSVG, which proposes a comprehensive dataset for unified SVG generation and understanding. Additionally, LL3M and 3DFroMLLM demonstrate the effectiveness of using LLMs for 3D asset creation and prototype generation, while SVGen and B-repLer showcase the potential of LLMs for vector graphics generation and semantic editing of CAD objects. Overall, these advancements have the potential to enable new applications and improve existing ones, such as building compliance checking, indoor scene understanding, and digital electrical layout planning.

Sources

DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models

UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models

LL3M: Large Language 3D Modelers

3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

SVGen: Interpretable Vector Graphics Generation with Large Language Models

B-repLer: Semantic B-rep Latent Editor using Large Language Models

SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Built with on top of