Advances in Self-Supervised Text Embeddings and Large Language Model Fine-Tuning

The field of natural language processing is witnessing a significant shift towards self-supervised learning and innovative fine-tuning techniques for large language models. Recent developments have shown that self-supervised training based on data augmentations can produce high-quality text embeddings, rivaling those obtained through extensive supervised fine-tuning. Moreover, new mechanisms such as token categorization and forgetting have been proposed to improve the fine-tuning process, allowing models to learn more precise information and mitigate the impact of low-quality data. Another area of advancement is the use of corpus-aware training, which enables models to learn the nuances between corpora and adapt to different inference behaviors. Noteworthy papers in this area include:

  • A study that found cropping to outperform dropout as an augmentation strategy for training self-supervised text embeddings, resulting in high-quality embeddings after short fine-tuning.
  • A proposal for Optimal Corpus Aware Training, which fine-tunes a pre-trained model by adjusting corpus-related parameters, leading to improved model accuracy and resilience to overfitting.
  • A reinforcement learning perspective on supervised fine-tuning, which introduces Dynamic Fine-Tuning to stabilize gradient updates and improve generalization capabilities.

Sources

Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings

Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Optimal Corpus Aware Training for Neural Machine Translation

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Built with on top of