The field of large language models (LLMs) is moving towards a more nuanced understanding of creativity and evaluation. Recent research has highlighted the importance of considering multiple dimensions of creativity, including quality, novelty, and diversity. Additionally, there is a growing recognition of the need for more comprehensive evaluation frameworks that can assess the performance of LLMs in a more holistic manner. The development of new diagnostic tools and benchmarks has enabled researchers to better understand the strengths and limitations of current LLMs. Noteworthy papers in this area include: Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation, which proposes a three-step evaluation framework to assess the performance of LLMs in generating classical Chinese poetry. HypoSpace, which introduces a diagnostic suite to evaluate the creativity of LLMs as set-valued hypothesis generators under underdetermination. CreativityPrism, which proposes a holistic benchmark for evaluating the creativity of LLMs across diverse scenarios.