Probabilistic topic modeling for comparative analysis of document collections

Ting Hua, Chang Tien Lu, Jaegul Choo, Chandan K. Reddy

    Research output: Contribution to journalArticlepeer-review

    22 Citations (Scopus)

    Abstract

    Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-artmodels. The proposedmodel can be useful for "comparative thinking" analysis in real-world document collections.

    Original languageEnglish
    Article numberA24
    JournalACM Transactions on Knowledge Discovery from Data
    Volume14
    Issue number2
    DOIs
    Publication statusPublished - 2020 Mar 4

    Bibliographical note

    Funding Information:
    This work was supported in part by the U.S. National Science Foundation grants IIS-1619028, IIS-1707498, and IIS-1838730, and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. NRF2019R1A2C4070420).

    Publisher Copyright:
    © 2020 Association for Computing Machinery.

    Keywords

    • Probabilistic topic modeling
    • text mining

    ASJC Scopus subject areas

    • General Computer Science

    Fingerprint

    Dive into the research topics of 'Probabilistic topic modeling for comparative analysis of document collections'. Together they form a unique fingerprint.

    Cite this