Skeleton-based Human Action Recognition and Synthesis

The field of skeleton-based human action recognition is moving towards leveraging transformer models and masked pretraining frameworks to improve representation learning. Recent work has focused on developing innovative architectures and frameworks that can efficiently learn generalizable skeleton representations and achieve state-of-the-art performance in various downstream tasks. Notably, there is a growing interest in exploring multi-scale representations, cross-sequence variations, and hierarchical graph attention mechanisms to enhance human action segmentation and synthesis.

Some noteworthy papers include: CascadeFormer, which proposes a two-stage cascading transformer framework for skeleton-based human action recognition. Towards Efficient General Feature Prediction in Masked Skeleton Modeling, which introduces a novel General Feature Prediction framework for efficient mask skeleton modeling. DuoCLR, which proposes a contrastive representation learning framework for human action segmentation via pre-training using trimmed skeleton sequences. Cortex-Synth, which presents a novel end-to-end differentiable framework for joint 3D skeleton geometry and topology synthesis from single 2D images.

Sources

CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition

Towards Efficient General Feature Prediction in Masked Skeleton Modeling

DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation

Cortex-Synth: Differentiable Topology-Aware 3D Skeleton Synthesis with Hierarchical Graph Attention

Built with on top of