Abstract
The Open Directory Project (ODP) is a large scale, high quality and publicly available web directory utilized in many studies and real-world applications. In this paper, we explore training data expansion techniques for text classification as one of the possible directions to deal with the sparse characteristic of the ODP dataset. We propose a dozen classification methods, which can be differentiated by (1) from which categories training data is expanded, and (2) how the expanded training data is merged to generate centroid vectors. Evaluation results show that training data expansion significantly improves the classification performance more than representative classifiers. We also find that (1) child and descendant categories are more valuable sources to expand training data than parent and ancestor categories, and (2) distance-based weighting is superior to simple averaging to merge the expanded training data.
Original language | English |
---|---|
Title of host publication | DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics |
Editors | George Karypis, Longbing Cao, Wei Wang, Irwin King |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 607-612 |
Number of pages | 6 |
ISBN (Electronic) | 9781479969913 |
DOIs | |
Publication status | Published - 2014 Mar 10 |
Event | 2014 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2014 - Shanghai, China Duration: 2014 Oct 30 → 2014 Nov 1 |
Publication series
Name | DSAA 2014 - Proceedings of the 2014 IEEE International Conference on Data Science and Advanced Analytics |
---|
Other
Other | 2014 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2014 |
---|---|
Country/Territory | China |
City | Shanghai |
Period | 14/10/30 → 14/11/1 |
Bibliographical note
Publisher Copyright:© 2014 IEEE.
ASJC Scopus subject areas
- Artificial Intelligence
- Information Systems
- Information Systems and Management