Human Motion Generation and Video Synthesis

The field of human motion generation and video synthesis is rapidly advancing, with a focus on developing models that can effectively capture motion dynamics and generate high-quality videos. Recent research has explored the use of contrastive learning, diffusion processes, and masked autoregressive modeling to improve motion fidelity and control. These approaches have shown promising results in generating realistic human motions and videos, with applications in animation, robotics, and virtual reality. Notably, the development of new attention mechanisms and disentanglement modules has enabled more precise control over video content and improved generation quality. Overall, the field is moving towards more robust and controllable models that can synthesize complex human motions and videos.

Noteworthy papers include: MoCLIP, which introduces a fine-tuned CLIP model with an additional motion encoding head to improve text-to-motion alignment results. Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion, which proposes a robust motion generation framework combining masked modeling with diffusion processes to generate motion using frame-level continuous representations. LMP, which harnesses the generative capabilities of pre-trained diffusion transformers to enable motion in generated videos to reference user-provided motion videos.

Sources

MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Large-Scale Multi-Character Interaction Synthesis

LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

Interspatial Attention for Efficient 4D Human Video Generation

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Built with on top of