The field of digital human modeling and animation is rapidly evolving, with a focus on creating more realistic and interactive digital humans. Recent developments have centered around improving the accuracy and control of facial animations, as well as enabling real-time interactions with digital humans. Notable advancements include the use of multimodal inputs, such as audio and text, to drive digital human interactions, and the development of novel architectures and loss functions to improve the fidelity and controllability of generated animations.
Some notable papers in this area include: X-Streamer, which introduces a unified framework for multimodal human world modeling, enabling real-time video calls driven by streaming multimodal inputs. StableDub, which proposes a novel framework for visual dubbing that integrates lip-habit-aware modeling with occlusion-robust synthesis, achieving superior performance in lip habit resemblance and occlusion robustness. SIE3D, which generates expressive 3D avatars from a single image and descriptive text, enabling detailed control over expressions via text. 3DiFACE, which synthesizes and edits holistic 3D facial animation, allowing for editing via keyframing and interpolation. Audio Driven Real-Time Facial Animation for Social Telepresence, which presents an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality.