TY - GEN
T1 - Visualizing the Embedding Space to Explain the Effect of Knowledge Distillation
AU - Lee, Hyun Seung
AU - Wallraven, Christian
N1 - Funding Information:
Acknowledgments. This work was supported by Institute of Information Communications Technology Planning Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019-0-00079), Department of Artificial Intelligence, Korea University
Publisher Copyright:
© 2022, Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Recent research has found that knowledge distillation can be effective in reducing the size of a network and in increasing generalization. A pre-trained, large teacher network, for example, was shown to be able to bootstrap a student model that eventually outperforms the teacher in a limited label environment. Despite these advances, it still is relatively unclear why this method works, that is, what the resulting student model does ‘better’. To address this issue, here, we utilize two non-linear, low-dimensional embedding methods (t-SNE and IVIS) to visualize representation spaces of different layers in a network. We perform a set of extensive experiments with different architecture parameters and distillation methods. The resulting visualizations and metrics clearly show that distillation guides the network to find a more compact representation space for higher accuracy already in earlier layers compared to its non-distilled version.
AB - Recent research has found that knowledge distillation can be effective in reducing the size of a network and in increasing generalization. A pre-trained, large teacher network, for example, was shown to be able to bootstrap a student model that eventually outperforms the teacher in a limited label environment. Despite these advances, it still is relatively unclear why this method works, that is, what the resulting student model does ‘better’. To address this issue, here, we utilize two non-linear, low-dimensional embedding methods (t-SNE and IVIS) to visualize representation spaces of different layers in a network. We perform a set of extensive experiments with different architecture parameters and distillation methods. The resulting visualizations and metrics clearly show that distillation guides the network to find a more compact representation space for higher accuracy already in earlier layers compared to its non-distilled version.
KW - Computer vision
KW - Knowledge distillation
KW - Limited data learning
KW - Transfer learning
KW - Visualization
UR - http://www.scopus.com/inward/record.url?scp=85130263349&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-02444-3_35
DO - 10.1007/978-3-031-02444-3_35
M3 - Conference contribution
AN - SCOPUS:85130263349
SN - 9783031024436
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 462
EP - 475
BT - Pattern Recognition - 6th Asian Conference, ACPR 2021, Revised Selected Papers
A2 - Wallraven, Christian
A2 - Liu, Qingshan
A2 - Nagahara, Hajime
PB - Springer Science and Business Media Deutschland GmbH
T2 - 6th Asian Conference on Pattern Recognition, ACPR 2021
Y2 - 9 November 2021 through 12 November 2021
ER -