Abstract
Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language preprocessing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.
Original language | English |
---|---|
Pages (from-to) | 1761-1772 |
Number of pages | 12 |
Journal | IEICE Transactions on Information and Systems |
Volume | E102D |
Issue number | 9 |
DOIs | |
Publication status | Published - 2019 |
Bibliographical note
Funding Information:Manuscript received November 14, 2018. Manuscript revised March 18, 2019. Manuscript publicized May 27, 2019. †The authors are with Korea University, South Korea. ††The author is with Sungshin University, South Korea. ∗This research was funded by research grants of Korea University and the Okawa Foundation. a) E-mail: [email protected] b) E-mail: [email protected] c) E-mail: hoh [email protected] (Corresponding author) DOI: 10.1587/transinf.2018EDP7390
Publisher Copyright:
© 2019 The Institute of Electronics, Information and Communication Engineers.
Keywords
- Pointwise Mutual Information (PMI)
- Software artifact
- Stop words
- Text mining
- Topic modeling
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Vision and Pattern Recognition
- Electrical and Electronic Engineering
- Artificial Intelligence