Logistic regression and the problem of separation
DOI:
https://doi.org/10.14720/aas.2024.120.3.19353Keywords:
logistic regression, maximum likelihood, small datasets, sparse datasets, separation, Firth's penalized likelihoodAbstract
Logistic regression is used to study the relationship between a binary outcome variable (one event may occur or not) and a set of covariates. Individualized prognosis can be obtained by estimating the probability of an event given the covariates. Moreover, regression coefficients, usually estimated by the method of maximum likelihood, can be interpreted as the log odds ratios. In situations where the data are small or sparse, the likelihood maximization algorithm may fail to converge, leading to implausible parameter estimates. In statistics, this situation is known as 'separation'. In practice, separation may go unnoticed due to software limitations in identifying the problem. The results obtained from such analyses can be puzzling and may be misinterpreted. Therefore, in this manuscript, we aim to: motivate the use of logistic regression to study the relationship between a binary outcome and a set of covariates; demonstrate the problem of separation with a real-data example; and show how to overcome separation.
References
Agresti, A. (2002). Categorical Data Analysis. John Wiley & Sons (etc.).
Albert, A. in Anderson, J.A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1), 1–10. https://doi.org/10.1093/biomet/71.1.1
Bitežnik, L., Štukelj, R., Flajšman, M. (2024). The efficiency of CBD production using grafted Cannabis sativa L. plants is highly dependent on the tzype of rootstock: A study. Plants, 13(8), 1117. https://doi.org/10.3390/plants13081117
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. https://doi.org/10.2307/2336755
García-Tejero, I., Zuazo, V., Sánchez-Carnenero, C., Hernández, A., Ferreiro-Vera, C. in Casano, S. (2019). Seeking suitable agronomical practices for industrial hemp (Cannabis sativa L.) cultivation for biomedical applications. Industrial Crops and Products, 139. https://doi.org/10.1016/J.INDCROP.2019.111524
Greenland, S., in Mansournia, M. A. (2015). Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in Medicine, 34, 3133–3143. 10.1002/sim.6537
Greenland, S., Mansournia, M.A. in Altman, D.G. (2016). Sparse data bias: a problem hiding in plain sight. BMJ, 352, i1981. https://doi.org/10.1136/bmj.i1981
Harrell, F. E., Jr. (2016). Regression modeling strategies. Springer International Publishing.
Heinze, G. in Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, 2409–2419. https://doi.org/10.1002/sim.1047
Heinze, G., Ploner, M. in Jiricka L. (2022). logistf: Firth’s bias reduced logistic regression. R package version 1.24.1. https://CRAN.R-project.org/package=logistf
IBM Corp. (2020). IBM SPSS Statistics for Windows (Version 27.0). IBM Corp.
Mansournia, M.A., Geroldinger, A., Greenland, S. in Heinze, G. (2018). Separation in logistic regression: Causes, consequences, and control. American Journal of Epidemiology, 187(4), 864–870. https://doi.org/10.1093/aje/kwx299
Newtonova metoda. (10. april 2024). Wikipedija. Pridobljeno s https://sl.wikipedia.org/w/index.php?title=Newtonova_metoda&oldid=5963185.
Puhr, R., Heinze, G., Nold, M., Lusa, L. in Geroldinger, A. (2017). Firth‘s logistic regression with rare events: accurate effect estimates and predictions? Statistics in Medicine, 36, 2302–2317. 10.1002/sim.7273
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Pridobljeno s https://www.R-project.org/
Rasera, G., Ohara, A. in Castro, R. (2021). Innovative and emerging applications of cannabis in food and beverage products: From an illicit drug to a potential ingredient for health promotion. Trends in Food Science and Technology, 115, 31–41. https://doi.org/10.1016/J.TIFS.2021.06.035
Salehi-Mohammadi, R., Khasi, A., Lee, S.G., Huh, Y.C., Lee, J.M. in Delshad, M. (2009). Assessing survival and growth performance of 713 Iranian melon to grafting onto cucurbita rootstocks. Korean Journal of Horticultural Science & Technology, 27(1), 1–6.
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310.
Šinkovec, H., Geroldinger, A. in Heinze, G. (2019). Bring more data!—A good advice? Removing separation in logistic regression by increasing sample size. International Journal of Environmental Research and Public Health, 16(23), 4658. https://doi.org/10.3390/ijerph16234658
van Smeden, M., de Groot, J.A., Moons, K.G., Collins G.S., Altman D.G., Eijkemans M.J. in Reitsma J.B. (2016). No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Medical Research Methodology, 16, 163. https://doi.org/10.1186/s12874-016-0267-3
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Hana Šinkovec, Damijana Kastelec, Luka Bitežnik
This work is licensed under a Creative Commons Attribution 4.0 International License.