Logistic regression and the problem of separation

Authors

DOI:

https://doi.org/10.14720/aas.2024.120.3.19353

Keywords:

logistic regression, maximum likelihood, small datasets, sparse datasets, separation, Firth's penalized likelihood

Abstract

Logistic regression is used to study the relationship between a binary outcome variable (one event may occur or not) and a set of covariates. Individualized prognosis can be obtained by estimating the probability of an event given the covariates. Moreover, regression coefficients, usually estimated by the method of maximum likelihood, can be interpreted as the log odds ratios. In situations where the data are small or sparse, the likelihood maximization algorithm may fail to converge, leading to implausible parameter estimates. In statistics, this situation is known as 'separation'. In practice, separation may go unnoticed due to software limitations in identifying the problem. The results obtained from such analyses can be puzzling and may be misinterpreted. Therefore, in this manuscript, we aim to: motivate the use of logistic regression to study the relationship between a binary outcome and a set of covariates; demonstrate the problem of separation with a real-data example; and show how to overcome separation.

References

Agresti, A. (2002). Categorical Data Analysis. John Wiley & Sons (etc.).

Albert, A. in Anderson, J.A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1), 1–10. https://doi.org/10.1093/biomet/71.1.1

Bitežnik, L., Štukelj, R., Flajšman, M. (2024). The efficiency of CBD production using grafted Cannabis sativa L. plants is highly dependent on the tzype of rootstock: A study. Plants, 13(8), 1117. https://doi.org/10.3390/plants13081117

Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. https://doi.org/10.2307/2336755

García-Tejero, I., Zuazo, V., Sánchez-Carnenero, C., Hernández, A., Ferreiro-Vera, C. in Casano, S. (2019). Seeking suitable agronomical practices for industrial hemp (Cannabis sativa L.) cultivation for biomedical applications. Industrial Crops and Products, 139. https://doi.org/10.1016/J.INDCROP.2019.111524

Greenland, S., in Mansournia, M. A. (2015). Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in Medicine, 34, 3133–3143. 10.1002/sim.6537

Greenland, S., Mansournia, M.A. in Altman, D.G. (2016). Sparse data bias: a problem hiding in plain sight. BMJ, 352, i1981. https://doi.org/10.1136/bmj.i1981

Harrell, F. E., Jr. (2016). Regression modeling strategies. Springer International Publishing.

Heinze, G. in Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, 2409–2419. https://doi.org/10.1002/sim.1047

Heinze, G., Ploner, M. in Jiricka L. (2022). logistf: Firth’s bias reduced logistic regression. R package version 1.24.1. https://CRAN.R-project.org/package=logistf

IBM Corp. (2020). IBM SPSS Statistics for Windows (Version 27.0). IBM Corp.

Mansournia, M.A., Geroldinger, A., Greenland, S. in Heinze, G. (2018). Separation in logistic regression: Causes, consequences, and control. American Journal of Epidemiology, 187(4), 864–870. https://doi.org/10.1093/aje/kwx299

Newtonova metoda. (10. april 2024). Wikipedija. Pridobljeno s https://sl.wikipedia.org/w/index.php?title=Newtonova_metoda&oldid=5963185.

Puhr, R., Heinze, G., Nold, M., Lusa, L. in Geroldinger, A. (2017). Firth‘s logistic regression with rare events: accurate effect estimates and predictions? Statistics in Medicine, 36, 2302–2317. 10.1002/sim.7273

R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Pridobljeno s https://www.R-project.org/

Rasera, G., Ohara, A. in Castro, R. (2021). Innovative and emerging applications of cannabis in food and beverage products: From an illicit drug to a potential ingredient for health promotion. Trends in Food Science and Technology, 115, 31–41. https://doi.org/10.1016/J.TIFS.2021.06.035

Salehi-Mohammadi, R., Khasi, A., Lee, S.G., Huh, Y.C., Lee, J.M. in Delshad, M. (2009). Assessing survival and growth performance of 713 Iranian melon to grafting onto cucurbita rootstocks. Korean Journal of Horticultural Science & Technology, 27(1), 1–6.

Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310.

Šinkovec, H., Geroldinger, A. in Heinze, G. (2019). Bring more data!—A good advice? Removing separation in logistic regression by increasing sample size. International Journal of Environmental Research and Public Health, 16(23), 4658. https://doi.org/10.3390/ijerph16234658

van Smeden, M., de Groot, J.A., Moons, K.G., Collins G.S., Altman D.G., Eijkemans M.J. in Reitsma J.B. (2016). No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Medical Research Methodology, 16, 163. https://doi.org/10.1186/s12874-016-0267-3

Downloads

Published

4. 10. 2024

Issue

Section

Original Scientific Article

How to Cite

Šinkovec, H., Kastelec, D., & Bitežnik, L. (2024). Logistic regression and the problem of separation. Acta Agriculturae Slovenica, 120(3), 1−10. https://doi.org/10.14720/aas.2024.120.3.19353

Most read articles by the same author(s)