The field of knowledge distillation is moving towards more innovative and effective methods for transferring capabilities from larger models to smaller ones. Recent developments focus on improving the training stability and convergence speed of distillation methods, as well as exploring new objectives and techniques for transferring knowledge. Notably, there is a trend towards incorporating geometric and structural information into distillation methods, such as using Procrustes distance and Feature Gram Matrix, to better capture the feature structure of the teacher model. Additionally, techniques such as progressive weight loading and circuit distillation are being proposed to accelerate initial inference and transfer algorithmic capabilities.
Noteworthy papers include:
- Enriching Knowledge Distillation with Intra-Class Contrastive Learning, which proposes incorporating an intra-class contrastive loss during teacher training to enrich the intra-class information contained in soft labels.
- Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments, which enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model.
- Circuit Distillation, which introduces an objective to align internal representations between analogous circuit components in teacher and student models, allowing for the transfer of algorithmic capabilities.
- Knowledge distillation through geometry-aware representational alignment, which theoretically shows that existing feature distillation methods cannot capture the feature structure and proposes using Procrustes distance and the Frobenius norm of Feature Gram Matrix as distillation losses.
- Distillation of Large Language Models via Concrete Score Matching, which proposes a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set.