Evaluating Gesture and Speech Generation

The field of gesture and speech generation is moving towards more rigorous evaluation practices, with a focus on standardizing protocols and metrics to accurately assess the quality and realism of generated outputs. Recent work has highlighted the importance of human-centered benchmarking and the need for more nuanced evaluation strategies that take into account multiple aspects of gesture and speech, such as motion realism, speech-gesture alignment, and paralinguistic cues. Noteworthy papers in this area include: Gesture Generation (Still) Needs Improved Human Evaluation Practices, which introduces a detailed human evaluation protocol for gesture-generation models and provides strong evidence for the need for disentangled assessments of motion quality and multimodal alignment. Speech-DRAME, a unified framework for human-aligned benchmarks in speech role-play, which substantially outperforms zero-shot and few-shot audio large language models and provides a comprehensive foundation for assessing spoken role-play. THEval, an evaluation framework for talking head video generation, which comprises 8 metrics related to quality, naturalness, and synchronization, and provides a benchmark for evaluating the improvement of generative methods.

Evaluating Gesture and Speech Generation

Sources