TY - GEN
T1 - Combining Dual Word Embeddings with Open Directory Project Based Text Classification
AU - Aliyeva, Dinara
AU - Kim, Kang Min
AU - Choi, Byung Ju
AU - Lee, Sang-Geun
N1 - Funding Information:
ACKNOWLEDGMENT This research was supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (number 2015R1A2A1A10052665). This research was also in part supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2018-2016-0-00464) supervised by the IITP(Institute for Information & communications Technology Promotion).
Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/4
Y1 - 2018/10/4
N2 - Traditional Open Directory Project (ODP)-based text classification methods effectively capture topics of texts by utilizing the hierarchical structure of explicitly human-built knowledge base. However, they only consider term weighting approaches, ignoring the important semantic similarity between words. In this paper, we consider the semantics of words by incorporating the implicit text representation, such as word2vec word embeddings, into the ODP-based text classification. In contrast to common usage of word2vec, we utilize the input and output vectors. This allows us to calculate a combined typical and topical similarity between words of category and document, which is more effective at text classification. To this end, we first incorporate the dual word embeddings of word2vec into the ODP-based text classification to obtain semantically richer category and document representations. Subsequently, we use the combination of the input and output vectors to improve the semantic similarity between category and document. Our evaluation results using a real-world dataset show the efficacy of our proposed approach, exhibiting a significant improvement of 9% and 37% in terms of Fl-score and precision at k, over the state-of-the-art techniques.
AB - Traditional Open Directory Project (ODP)-based text classification methods effectively capture topics of texts by utilizing the hierarchical structure of explicitly human-built knowledge base. However, they only consider term weighting approaches, ignoring the important semantic similarity between words. In this paper, we consider the semantics of words by incorporating the implicit text representation, such as word2vec word embeddings, into the ODP-based text classification. In contrast to common usage of word2vec, we utilize the input and output vectors. This allows us to calculate a combined typical and topical similarity between words of category and document, which is more effective at text classification. To this end, we first incorporate the dual word embeddings of word2vec into the ODP-based text classification to obtain semantically richer category and document representations. Subsequently, we use the combination of the input and output vectors to improve the semantic similarity between category and document. Our evaluation results using a real-world dataset show the efficacy of our proposed approach, exhibiting a significant improvement of 9% and 37% in terms of Fl-score and precision at k, over the state-of-the-art techniques.
KW - Machine Learning
KW - Text Classification
KW - Word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85056464881&partnerID=8YFLogxK
U2 - 10.1109/ICCI-CC.2018.8482044
DO - 10.1109/ICCI-CC.2018.8482044
M3 - Conference contribution
AN - SCOPUS:85056464881
T3 - Proceedings of 2018 IEEE 17th International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2018
SP - 179
EP - 186
BT - Proceedings of 2018 IEEE 17th International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2018
A2 - Howard, Newton
A2 - Kwong, Sam
A2 - Wang, Yingxu
A2 - Feldman, Jerome
A2 - Widrow, Bernard
A2 - Sheu, Phillip
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2018
Y2 - 16 July 2018 through 18 July 2018
ER -