The field of speech synthesis and evaluation is rapidly evolving, with a focus on improving the naturalness and intelligibility of generated speech. Researchers are exploring new architectures and techniques, such as conditional diffusion models and consistency Schrödinger bridges, to enhance the quality of singing voice synthesis and text-to-speech systems. Additionally, there is a growing emphasis on developing more robust evaluation metrics that can accurately assess the intelligibility and quality of synthesized speech. Noteworthy papers include the VS-Singer model, which generates stereo singing voices with room reverberation from scene images, and the SmoothSinger model, which synthesizes high-quality singing voices using a conditional diffusion model. The TTSDS2 metric has also been proposed as a more robust and improved method for evaluating human-quality text-to-speech systems.