Evaluating Gesture and Speech Generation

The field of gesture and speech generation is moving towards more rigorous evaluation practices, with a focus on standardizing protocols and metrics to accurately assess the quality and realism of generated outputs. Recent work has highlighted the importance of human-centered benchmarking and the need for more nuanced evaluation strategies that take into account multiple aspects of gesture and speech, such as motion realism, speech-gesture alignment, and paralinguistic cues. Noteworthy papers in this area include: Gesture Generation (Still) Needs Improved Human Evaluation Practices, which introduces a detailed human evaluation protocol for gesture-generation models and provides strong evidence for the need for disentangled assessments of motion quality and multimodal alignment. Speech-DRAME, a unified framework for human-aligned benchmarks in speech role-play, which substantially outperforms zero-shot and few-shot audio large language models and provides a comprehensive foundation for assessing spoken role-play. THEval, an evaluation framework for talking head video generation, which comprises 8 metrics related to quality, naturalness, and synchronization, and provides a benchmark for evaluating the improvement of generative methods.

Sources

Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms

THEval. Evaluation Framework for Talking Head Video Generation

Built with on top of