Knowledge Distillation Advancements

The field of knowledge distillation is moving towards addressing the challenges of heterogeneous architecture distillation, where traditional methods struggle to effectively transfer knowledge from a complex teacher to a compact student due to differences in spatial feature representations. Recent innovations have focused on developing simple yet effective frameworks that integrate complementary teacher and student features, such as decomposing and constraining shared logits to facilitate diverse knowledge transfer. Another area of advancement is the introduction of dynamic temperature scheduling methods that adapt to the divergence between teacher and student distributions, allowing for more effective knowledge transfer. Additionally, there is a growing trend towards rethinking feature distillation in vision transformers, with a focus on understanding the failure of standard feature distillation methods and proposing minimal, mismatch-driven strategies to reactivate simple feature-map distillation. Noteworthy papers include: Heterogeneous Complementary Distillation, which proposes a framework that integrates complementary teacher and student features to align representations in shared logits. Dynamic Temperature Scheduler for Knowledge Distillation, which introduces a temperature scheduling method that adapts based on the cross-entropy loss gap between teacher and student. Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation, which proposes a feature KD framework that leverages feature-based losses exclusively. From Low-Rank Features to Encoding Mismatch, which conducts a two-view representation analysis of vision transformers and proposes minimal, mismatch-driven strategies to reactivate simple feature-map distillation.

Knowledge Distillation Advancements

Sources