The field of text-to-image generation is rapidly advancing, with a focus on improving the evaluation and fairness of models. Researchers are exploring new methods for evaluating text-to-image models, including the use of multi-modal language models and benchmarks that assess world knowledge grounding and implicit inferential capabilities. Additionally, there is a growing concern about the cultural biases present in these models, with efforts to develop more inclusive and diverse datasets.
Noteworthy papers in this area include:
- Multi-Modal Language Models as Text-to-Image Model Evaluators, which presents a novel evaluation framework that uses multi-modal language models to assess prompt-generation consistency and image aesthetics.
- Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models, which benchmarks a metric designed to evaluate the fidelity of image generation across cultural contexts and provides insights into architectural and data-centric interventions for enhancing cultural inclusivity.
- WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation, which introduces a benchmark designed to systematically evaluate text-to-image models' world knowledge grounding and implicit inferential capabilities.
- Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition, which proposes a novel method that leverages retrieval-augmented generation and domain-specific large language models to produce precise multipart descriptions for sign language recognition.
- CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts, which introduces a novel benchmark designed to evaluate the robustness of large language models on code generation from code-mixed prompts.