Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Donghwa Kim, Deokseong Seo, Suhyoun Cho, Pilsung Kang

Research output: Contribution to journalArticlepeer-review

288 Citations (Scopus)

Abstract

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

Original languageEnglish
Pages (from-to)15-29
Number of pages15
JournalInformation Sciences
Volume477
DOIs
Publication statusPublished - 2019 Mar

Bibliographical note

Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education ( NRF-2016R1D1A1B03930729 ) and Institute for Information & Communications Technology Promotion ( IITP ) grant funded by the Korean Government ( MSIP ) (No. 2017-0-00349 ), Development of Media Streaming system with Machine Learning using QoE (Quality of Experience). This work was also supported by Korea Electric Power Corporation. (Grant number: R18XA05).

Funding Information:
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03930729) and Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIP) (No. 2017-0-00349), Development of Media Streaming system with Machine Learning using QoE (Quality of Experience). This work was also supported by Korea Electric Power Corporation. (Grant number: R18XA05).

Publisher Copyright:
© 2018 Elsevier Inc.

Keywords

  • Co-training
  • Doc2vec
  • Document classification
  • LDA
  • Semi-supervised learning
  • TF–IDF

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Control and Systems Engineering
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec'. Together they form a unique fingerprint.

Cite this