Knowledge Distillation Advances

The field of knowledge distillation is moving towards a deeper understanding of the internal mechanisms and processes that occur during the distillation process. Researchers are exploring new methods to improve the efficiency and effectiveness of knowledge distillation, including the use of mechanistic interpretability techniques and adaptive denoising. The development of novel training frameworks and distillation methods is also a key area of focus, with a particular emphasis on improving the generalization and fidelity of distilled models. Notable papers in this area include:

  • Distilled Circuits, which applies mechanistic interpretability to analyze the internal restructuring of knowledge distillation,
  • ToDi, which proposes a token-wise distillation method that adaptively combines forward and reverse KL divergence,
  • DeepKD, which integrates dual-level decoupling with adaptive denoising to improve knowledge transfer,
  • On the Generalization vs Fidelity Paradox in Knowledge Distillation, which presents a large-scale empirical and statistical analysis of knowledge distillation across models of varying sizes.

Sources

Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer

On the Generalization vs Fidelity Paradox in Knowledge Distillation

Small-to-Large Generalization: Data Influences Models Consistently Across Scale

ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Built with on top of