The field of visual content generation is rapidly evolving, with a focus on improving the quality and coherence of generated images and 3D models. Recent developments have centered around the use of multimodal large language models (MLLMs) and novel evaluation metrics to assess the semantic coherence and structural fidelity of generated content. Notably, researchers have proposed innovative approaches to text-to-3D generation, emotional image content generation, and scene composition structure evaluation. These advancements have the potential to significantly impact various applications, including virtual reality, computer-aided design, and generative art.
Some noteworthy papers in this area include: Sel3DCraft, which introduces a visual prompt engineering system for text-to-3D generation that supports creativity for designers. CoEmoGen, which proposes a novel pipeline for emotional image content generation that leverages MLLMs and achieves superior emotional faithfulness and semantic coherence. SCSSIM, which presents a novel image similarity metric for scene composition structure that quantifies structural fidelity and preserves non-object-based relationships.