Multi-pretraining for large-scale text classification

Kang Min Kim, Bumsu Hyeon, Yeachan Kim, Jun Hyung Park, Sang Keun Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)


Deep neural network-based pretraining methods have achieved impressive results in many natural language processing tasks including text classification. However, their applicability to large-scale text classification with numerous categories (e.g., several thousands) is yet to be well-studied, where the training data is insufficient and skewed in terms of categories. In addition, existing pretraining methods usually involve excessive computation and memory overheads. In this paper, we develop a novel multi-pretraining framework for large-scale text classification. This multi-pretraining framework includes both a self-supervised pretraining and a weakly supervised pretraining. We newly introduce an out-of-context words detection task on the unlabeled data as the self-supervised pretraining. It captures the topic-consistency of words used in sentences, which is proven to be useful for text classification. In addition, we propose a weakly supervised pretraining, where labels for text classification are obtained automatically from an existing approach. Experimental results clearly show that both pretraining approaches are effective for large-scale text classification task. The proposed scheme exhibits significant improvements as much as 3.8% in terms of macro-averaging F1-score over strong pretraining methods, while being computationally efficient.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publicationEMNLP 2020
PublisherAssociation for Computational Linguistics (ACL)
Number of pages10
ISBN (Electronic)9781952148903
Publication statusPublished - 2020
EventFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online
Duration: 2020 Nov 162020 Nov 20

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020


ConferenceFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
CityVirtual, Online

Bibliographical note

Funding Information:
We would like to thank the anonymous reviewers for their valuable comments. This research was supported by the Basic Research Program through the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) (No. 2020R1A4A1018309), the NRF grant funded by the Korea government (MSIT) (No. 2018R1A2A1A05078380), and Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program(Korea University)).

Publisher Copyright:
©2020 Association for Computational Linguistics

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics


Dive into the research topics of 'Multi-pretraining for large-scale text classification'. Together they form a unique fingerprint.

Cite this