Advances in Self-Supervised Text Embeddings and Large Language Model Fine-Tuning

The field of natural language processing is witnessing a significant shift towards self-supervised learning and innovative fine-tuning techniques for large language models. Recent developments have shown that self-supervised training based on data augmentations can produce high-quality text embeddings, rivaling those obtained through extensive supervised fine-tuning. Moreover, new mechanisms such as token categorization and forgetting have been proposed to improve the fine-tuning process, allowing models to learn more precise information and mitigate the impact of low-quality data. Another area of advancement is the use of corpus-aware training, which enables models to learn the nuances between corpora and adapt to different inference behaviors. Noteworthy papers in this area include:

A study that found cropping to outperform dropout as an augmentation strategy for training self-supervised text embeddings, resulting in high-quality embeddings after short fine-tuning.
A proposal for Optimal Corpus Aware Training, which fine-tunes a pre-trained model by adjusting corpus-related parameters, leading to improved model accuracy and resilience to overfitting.
A reinforcement learning perspective on supervised fine-tuning, which introduces Dynamic Fine-Tuning to stabilize gradient updates and improve generalization capabilities.

Advances in Self-Supervised Text Embeddings and Large Language Model Fine-Tuning

Sources