Tutoring Helps Students Learn Better: Improving Knowledge Distillation for BERT with Tutor Network

Junho Kim, Jun Hyung Park, Mingyu Lee, Wing Lam Mok, Joon Young Choi, Sang Keun Lee

Research output: Contribution to conferencePaperpeer-review

1 Citation (Scopus)

Abstract

Pre-trained language models have achieved remarkable successes in natural language processing tasks, coming at the cost of increasing model size. To address this issue, knowledge distillation (KD) has been widely applied to compress language models. However, typical KD approaches for language models have overlooked the difficulty of training examples, suffering from incorrect teacher prediction transfer and sub-efficient training. In this paper, we propose a novel KD framework, Tutor-KD, which improves the distillation effectiveness by controlling the difficulty of training examples during pre-training. We introduce a tutor network that generates samples that are easy for the teacher but difficult for the student, with training on a carefully designed policy gradient method. Experimental results show that TutorKD significantly and consistently outperforms the state-of-the-art KD methods with variously sized student models on the GLUE benchmark, demonstrating that the tutor can effectively generate training examples for the student.

Original languageEnglish
Pages7371-7382
Number of pages12
Publication statusPublished - 2022
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 2022 Dec 72022 Dec 11

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period22/12/722/12/11

Bibliographical note

Funding Information:
We thank the anonymous reviewers for their helpful comments. This work was supported by the Basic Research Program through the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2021R1A2C3010430), National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2020R1A4A1018309), and Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University)).

Publisher Copyright:
© 2022 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Tutoring Helps Students Learn Better: Improving Knowledge Distillation for BERT with Tutor Network'. Together they form a unique fingerprint.

Cite this