The field of computer vision and graphics is rapidly advancing, with a focus on improving text-to-image and scene synthesis capabilities. Recent developments have led to the creation of more sophisticated models that can generate high-quality images and scenes from text descriptions. These models have the potential to revolutionize various applications, including urban design, architecture, and digital content creation. Notably, the integration of large language models and multimodal diffusion models has enabled more adaptive and controllable design processes. Furthermore, the use of spatial reasoning and relative composition of images has improved the accuracy and flexibility of scene synthesis. Overall, the field is moving towards more advanced and realistic text-to-image and scene synthesis capabilities. Noteworthy papers include:
- ComposeAnything, which introduces a novel framework for improving compositional image generation using chain-of-thought reasoning and spatial-controlled denoising.
- ReSpace, which presents a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models and a compact structured scene representation.
- FreeScene, which enables both convenient and effective control for indoor scene synthesis using a Mixed Graph Diffusion Transformer.
- PartComposer, which learns and composes part-level concepts from single-image examples, enabling text-to-image diffusion models to create novel objects from meaningful components.