The field of human animation and interaction is rapidly advancing, with a focus on creating more realistic and interactive experiences. One of the key trends is the use of diffusion-based models, which have shown impressive results in generating high-quality video and animating human characters. These models are being used to create more realistic facial expressions, body movements, and interactions between humans and objects.
Another area of research is the development of frameworks that can generate video content with multiple concepts, such as humans, objects, and audio conditions. These frameworks are enabling the creation of more complex and interactive videos, with applications in fields such as e-commerce, education, and entertainment.
The use of attention mechanisms and multimodal fusion is also becoming increasingly popular, allowing for more precise control over the generation of video content and enabling the incorporation of multiple sources of information, such as text, audio, and images.
Notable papers in this area include:
- LLIA, which presents a novel audio-driven portrait video generation framework that achieves low-latency and high-fidelity output.
- ChronoTailor, which introduces a diffusion-based framework for fine-grained video virtual try-on that generates temporally consistent videos while preserving garment details.
- HunyuanVideo-HOMA, which proposes a weakly conditioned multimodal-driven framework for generic human-object interaction in multimodal driven human animation.
- HopaDIFF, which pioneers textual reference-guided human action segmentation in multi-person settings and achieves state-of-the-art results on the RHAS133 dataset.
- InterActHuman, which introduces a novel framework for multi-concept human animation with layout-aligned audio conditions and enables the generation of controllable multi-concept human-centric videos.
- Controllable Expressive 3D Facial Animation, which presents a diffusion-based framework for controllable expressive 3D facial animation via diffusion in a unified multimodal space.
- DreamActor-H1, which proposes a Diffusion Transformer-based framework for high-fidelity human-product demonstration video generation via motion-designed diffusion transformers.