The field of action detection and recognition is moving towards more open-vocabulary and view-invariant approaches. Researchers are exploring ways to reduce the reliance on parameter-heavy architectures and large-scale datasets, and instead focusing on developing more efficient and adaptable models. One of the key challenges in this area is handling extreme viewpoint differences and occlusions, and several innovative solutions have been proposed to address this issue. Notably, some papers have introduced novel curriculum learning procedures and knowledge distillation objectives to learn view-invariant representations. Additionally, there is a growing interest in weakly-supervised and few-shot learning methods, which aim to reduce the need for extensive labeled data. Overall, the field is seeing a shift towards more flexible and generalizable models that can handle diverse and complex video data. Noteworthy papers include:
- Scaling Open-Vocabulary Action Detection, which introduces an encoder-only multimodal model and a new benchmark for open-vocabulary action detection.
- Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation, which proposes a method for learning view-invariant representations using a curriculum learning procedure.
- Temporal Alignment-Free Video Matching for Few-shot Action Recognition, which introduces a novel approach for few-shot action recognition that eliminates the need for temporal units in action representation.