The field of skeleton-based human action recognition is moving towards leveraging transformer models and masked pretraining frameworks to improve representation learning. Recent work has focused on developing innovative architectures and frameworks that can efficiently learn generalizable skeleton representations and achieve state-of-the-art performance in various downstream tasks. Notably, there is a growing interest in exploring multi-scale representations, cross-sequence variations, and hierarchical graph attention mechanisms to enhance human action segmentation and synthesis.
Some noteworthy papers include: CascadeFormer, which proposes a two-stage cascading transformer framework for skeleton-based human action recognition. Towards Efficient General Feature Prediction in Masked Skeleton Modeling, which introduces a novel General Feature Prediction framework for efficient mask skeleton modeling. DuoCLR, which proposes a contrastive representation learning framework for human action segmentation via pre-training using trimmed skeleton sequences. Cortex-Synth, which presents a novel end-to-end differentiable framework for joint 3D skeleton geometry and topology synthesis from single 2D images.