The field of humanoid robotics and virtual avatars is rapidly advancing, with a focus on creating more realistic and expressive digital humans. Researchers are developing new datasets and frameworks that enable more nuanced and realistic facial expressions, body movements, and gesture synthesis. These advancements have the potential to enhance human-robot interaction, virtual reality, and remote communication. Notably, innovative approaches are being proposed to address challenges such as choreographic consistency in music-to-dance generation, asynchronous latent consistency models for whole-body audio-driven avatars, and expressive virtual avatars from multi-view videos.
Some papers are particularly noteworthy, including X2C, which introduces a high-quality dataset for realistic humanoid facial expression imitation, and EVA, which presents an actor-specific, fully controllable, and expressive human avatar framework. Additionally, AsynFusion and MatchDance propose novel frameworks for whole-body audio-driven avatar pose and expression generation, and music-to-dance generation, respectively. Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On and Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction also demonstrate state-of-the-art performance in virtual try-on tasks.