Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations

Tian Shi, Kyeongpil Kang, Jaegul Choo, Chandan K. Reddy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

129 Citations (Scopus)

Abstract

Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual information, and they are sparse, noisy and ambiguous, and hence, automatically learning topics from them remains an important challenge. To tackle this problem, in this paper, we propose a semantics-assisted non-negative matrix factorization (SeaNMF) model to discover topics for the short texts. It effectively incorporates the word-context semantic correlations into the model, where the semantic relationships between the words and their contexts are learned from the skip-gram view of the corpus. The SeaNMF model is solved using a block coordinate descent algorithm. We also develop a sparse variant of the SeaNMF model which can achieve a better model interpretability. Extensive quantitative evaluations on various real-world short text datasets demonstrate the superior performance of the proposed models over several other state-of-the-art methods in terms of topic coherence and classification accuracy. The qualitative semantic analysis demonstrates the interpretability of our models by discovering meaningful and consistent topics. With a simple formulation and the superior performance, SeaNMF can be an effective standard topic model for short texts.

Original languageEnglish
Title of host publicationThe Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018
PublisherAssociation for Computing Machinery, Inc
Pages1105-1114
Number of pages10
ISBN (Electronic)9781450356398
DOIs
Publication statusPublished - 2018 Apr 10
Event27th International World Wide Web, WWW 2018 - Lyon, France
Duration: 2018 Apr 232018 Apr 27

Publication series

NameThe Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018

Conference

Conference27th International World Wide Web, WWW 2018
Country/TerritoryFrance
CityLyon
Period18/4/2318/4/27

Bibliographical note

Funding Information:
As we can see from Fig. 2, all the graphs for the standard NMF model are very sparse. Some keywords with higher frequency in the corpus have lower degree which means that they are less correlated with the other words. For example, the frequency of ‘chicken’ is high, however, its most correlated words do not contain the other keywords and it is not in the most correlated word lists of the other keywords. In the standard topic modeling, these keywords might be viewed as noise. In Table 6, the keywords with degree less than two are colored in red. We can see that the topics obtained from the standard NMF model are noisy. On the other hand, we conduct the same experiments on our SeaNMF model. From Table 6 and Fig. 2, we can see that topics discovered by our SeaNMF model have less noisy words and the top keywords are more correlated. Therefore, these semantic analysis results demonstrate that the SeaNMF model can discover meaningful and consistent topics for short texts. 5 CONCLUSION In this paper, we introduce a semantics-assisted NMF (SeaNMF) model to discover topics for the short texts. The proposed model leverages the word-context semantic correlations in the training, which potentially overcomes the problem of lacking context that arises due to the data sparsity. The semantic correlations between the words and their contexts are learned from the skip-gram view of the corpus, which was demonstrated to be effective for revealing word semantic relationships. We use a block coordinate descent algorithm to solve our SeaNMF model. To achieve a better model interpretability, a sparse SeaNMF model is also developed. We compared the performance of our models with several other state-of-the-art methods on four real-world short text datasets. The quantitative evaluations demonstrate that our models outperform other methods with respect to widely used metrics such as the topic coherence and document classification accuracy. The parameter sensitivity results demonstrate the stability and consistency of the performance of our SeaNMF model. The qualitative results show that the topics discovered by SeaNMF are meaningful and their top keywords are more semantically correlated. Hence, we conclude that the proposed SeaNMF is an effective topic model for short texts. ACKNOWLEDGMENTS This work was supported in part by the National Science Foundation grants IIS-1619028, IIS-1707498 and IIS-1646881, and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. NRF-2016R1C1B2015924).

Funding Information:
Thiswork was supported in part by the National Science Foundation grants IIS-1619028, IIS-1707498 and IIS-1646881, and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. NRF-2016R1C1B2015924).

Publisher Copyright:
© 2018 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC BY 4.0 License.

Keywords

  • Non-negative matrix factorization
  • Short texts
  • Topic modeling
  • Word embedding

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations'. Together they form a unique fingerprint.

Cite this