The estimation of probability distribution for factor variables with many categorical values

Minhyeok Lee, Yeong Seon Kang, Junhee Seok

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)


With recent developments of data technology in biomedicine, factor data such as diagnosis codes and genomic features, which can have tens to hundreds of discrete and unorderable categorical values, have emerged. While considered as a fundamental problem in statistical analyses, the estimation of probability distribution for such factor variables has not studied much because the previous studies have mainly focused on continuous variables and discrete factor variables with a few categories such as sex and race. In this work, we propose a nonparametric Bayesian procedure to estimate the probability distribution of factors with many categories. The proposed method was demonstrated through simulation studies under various conditions and showed significant improvements on the estimation errors from the previous conventional methods. In addition, the method was applied to the analysis of diagnosis data of intensive care unit patients, and generated interesting medical hypotheses. The overall results indicate that the proposed method will be useful in the analysis of biomedical factor data.

Original languageEnglish
Article numbere0202547
JournalPloS one
Issue number8
Publication statusPublished - 2018 Aug

Bibliographical note

Funding Information:
This research was supported by grants from National Research Foundation of Korea ( funded by the Korea government (NRF-2016R1D1A1B03931077 to J. S.), Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2017-0-00053, A Technology Development of Artificial Intelligence Doctors for Cardiovascular Disease), and Korea Evaluation Institute of Industrial Technology (10073166). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Publisher Copyright:
© 2018 Lee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

ASJC Scopus subject areas

  • General Biochemistry,Genetics and Molecular Biology
  • General Agricultural and Biological Sciences
  • General


Dive into the research topics of 'The estimation of probability distribution for factor variables with many categorical values'. Together they form a unique fingerprint.

Cite this