Variable selection is one of the most crucial tasks in supervised learning, such as regression and classification. The best subset selection is straightforward and optimal but not practically applicable unless the number of predictors is small. In this article, we propose directly solving the best subset selection via the genetic algorithm (GA), a popular stochastic optimization algorithm based on the principle of Darwinian evolution. To further improve the variable selection performance, we propose to run multiple GA to solve the best subset selection and then synthesize the results, which we call ensemble GA (EGA). The EGA significantly improves variable selection performance. In addition, the proposed method is essentially the best subset selection and hence applicable to a variety of models with different selection criteria. We compare the proposed EGA to existing variable selection methods under various models, including linear regression, Poisson regression, and Cox regression for survival data.
|Number of pages
|Communications for Statistical Applications and Methods
|Published - 2022
Bibliographical noteFunding Information:
This work is funded by the National Research Foundation of Korea (NRF) grants (2018R1D1A1B070 43034, 2019R1A4A1028134) and Korea University (K2000461).
© 2022 The Korean Statistical Society, and Korean International Statistical Society. All rights reserved.
- Cox regression
- Ensemble learning
- Generalized linear model
- Genetic algorithm
ASJC Scopus subject areas
- Statistics and Probability
- Modelling and Simulation
- Statistics, Probability and Uncertainty
- Applied Mathematics