The field of text-to-3D generation is moving towards more robust and accurate methods for producing high-quality 3D outputs, particularly for out-of-domain or rare concepts. Recent developments have focused on leveraging pretrained 2D diffusion priors, multiview consistency, and commonsense priors from videos to improve 3D consistency, photorealism, and text adherence. Notable advancements include the use of retrieval-augmented diffusion models, structural energy-guided sampling, and large-scale video datasets with multi-view level annotations. These innovations have the potential to significantly improve the state-of-the-art in text-to-3D generation and enable more realistic and plausible 3D content creation. Noteworthy papers include: MV-RAG, which proposes a novel text-to-3D pipeline that retrieves relevant 2D images and conditions a multiview diffusion model on these images. Droplet3D, which introduces a large-scale video dataset with multi-view level annotations and trains a generative model supporting both image and dense text input.