Co-Speech Gesture Generation Advances

The field of co-speech gesture generation is moving towards more semantic and context-aware approaches, with a focus on generating gestures that are not only rhythmic but also semantically coherent and relevant to the speech. This is evident in the development of novel architectures and techniques that integrate semantic information at both fine-grained and global levels, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Noteworthy papers in this regard include SemGes, which proposes a semantics-aware co-speech gesture generation approach using semantic coherence and relevance learning, and MECo, which leverages large language models to enable motion-example-controlled co-speech gesture generation. Additionally, GestureHYDRA introduces a hybrid-modality diffusion transformer architecture for semantic co-speech gesture synthesis, and Real-time Generation of Various Types of Nodding proposes a model for predicting both the timing and type of nodding in real-time.

Sources

SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System

Built with on top of