The field of video representation learning is moving towards self-supervised learning approaches, which aim to learn effective video representations without requiring manual annotations. This direction is driven by the need to efficiently process and analyze large amounts of video data. Recent developments focus on improving the efficiency and effectiveness of self-supervised learning methods, including the use of masked-embedding autoencoders, adversarial frame sampling, and multimodal correlations. Notable advancements include the ability to process longer videos, improve retrieval performance, and enhance generalization across different downstream tasks. Noteworthy papers include: LV-MAE, which introduces a self-supervised learning framework for long video representation, achieving state-of-the-art results on several benchmarks. AutoSSVH, which proposes a new framework for self-supervised video hashing, employing adversarial frame sampling and hash-based contrastive learning to enhance encoding capability.