Text-to-3D Generation Advances

The field of text-to-3D generation is moving towards more robust and accurate methods for producing high-quality 3D outputs, particularly for out-of-domain or rare concepts. Recent developments have focused on leveraging pretrained 2D diffusion priors, multiview consistency, and commonsense priors from videos to improve 3D consistency, photorealism, and text adherence. Notable advancements include the use of retrieval-augmented diffusion models, structural energy-guided sampling, and large-scale video datasets with multi-view level annotations. These innovations have the potential to significantly improve the state-of-the-art in text-to-3D generation and enable more realistic and plausible 3D content creation. Noteworthy papers include: MV-RAG, which proposes a novel text-to-3D pipeline that retrieves relevant 2D images and conditions a multiview diffusion model on these images. Droplet3D, which introduces a large-scale video dataset with multi-view level annotations and trains a generative model supporting both image and dense text input.

Sources

MV-RAG: Retrieval Augmented Multiview Diffusion

Structural Energy-Guided Sampling for View-Consistent Text-to-3D

MonoRelief V2: Leveraging Real Data for High-Fidelity Monocular Relief Recovery

Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

Built with on top of