Model choice, i.e., the choice of covariates to be included in a model, is a lasting problem in statistics. Numerous solutions have been proposed. Among them, one of the oldest and most widely known is the Akaike’s information criterion (AIC) (Akaike, 1973). Bayesian information criterion (BIC) (Schwarz, 1978) is also widely used. They are target functions which are minimized to arrive at a favorable model. AIC aims to choose the model which has the best predictive power and BIC aims to choose the true model if it is included in the tested subset. BIC is more conservative than AIC, as it penalizes each additional parameter more strictly, and thus, BIC favors smaller models. However, as the derivations of AIC and BIC are based on asymptotic arguments (Akaike, 1973; Schwarz, 1978), it makes sense to compare their performance in finite data sets.
Regularized regression models offer another approach to the model choice problem. These are regression models where the model fit is penalized by some norm of the parameter vector. Methods based on the L1 norm, most notably Lasso (Tibshirani, 1996) and Adaptive Lasso (Zou, 2006) produce a model where a parsimonious subset of the available covariates is used, and thus, they perform model choice. However, regularized regressions based on the L2 norm (Hoerl & Kennard, 1970) assign a small contribution to all variables, and thus, they do not perform model choice.
Logistic regression models are models for binary or binomial outcomes. They are a type of generalized linear model (McCullagh & Nelder, 1989) and perhaps the most widely used of them all. They are frequently used in areas as diverse as genetics (Ayers & Cordell, 2010), clinical medicine (Zhang et al., 2017), econometrics (Karhunen, 2019), consumer research (Bejaei et al., 2015) and environmental science (Saha & Pal, 2019), to name a few. Thus, it is important to study the model choice problem in the context of logistic regression. This paper uses simulated data to compare the performance of nine model choice algorithms. The results show that there is no one-size-fits-all solution.
The rest of this paper is organized as follows. First, the methods to be compared are introduced, then the data generation and then the measures of accuracy (specificity and two different kinds of sensitivity). In the Results section, the accuracy of the different model choice strategies is first discussed, and then the optimal model choice. A discussion concludes.
The logistic regression is here defined without loss of generality for a binary response variable. Formulations where each observational unit is a binomial trial are sometimes seen (e.g., Qian & Field, 2000). However, they can be trivially transformed into the present definition. The logistic regression can be defined through the loglikelihood function
The following methods are tested for their potential to detect the true model in simulated data:
Wald’s tests. This is a one-step procedure where all variables which are univariately significant with P values <0.05 are included in the final model specification. The statistical significance is calculated by using the Wald’s test (e.g., Buse, 1982). Likelihood-ratio test. Let Akaike’s AIC (1973). The model with the minimal Schwarz’s BIC (1978). The model with the minimal AICc of Hurvich and Tsai (1989). The idea of this method is to correct for the fact that AIC has been derived assuming infinite data sets. Thus, the criterion to be minimized is AICc as BICc below (McQuarrie, 1999) has been derived for the Gaussian regression model. There is a corrected AIC for generalized linear models (Imori et al., 2014) but it was found too slow for the Monte Carlo simulations of this study. BICc of McQuarrie (1999). The idea is to correct for the finite-sample bias of BIC. The model with the minimal MDL of Zhou et al. (2005). This formula was derived from Rissanen’s minimum description length principle (1978) by Qian and Field (2002) following Qian and Künsch (1998). The model with the minimal Note that the sums ranging from 2 to Tibshirani’s Lasso (1996). A model with all covariates is fitted, but instead of maximizing the log-likelihood, one maximizes Zou’s Adaptive Lasso (2006). Otherwise the same as Lasso, but the target function to be maximized is
In this study, methods 2-7 are applied so that all models are scored with each respective information criterion. This is contrary to the practice of stepwise model selection but is appropriate for a small number of covariates.
Test data were used to compare the performance of different model choice algorithms. The data were generated by using Algorithm 1. Variations of Algorithm 1 are used in the Appendix to generate three additional scenarios: One with two true effects for each data of type D=1, one where the effect size is larger (
Repeat nmc times:
For sample size n=100,350,1000:
For trial type D=0,1:
Generate four covariates with sample size n, so that x1~N(0,1), x2~N(0,1), x3~Binary(0.5) and x4 is distributed as a three-level factor with each factor level occurring with probability 1/3.
If D=0:
Generate an independent response variable y distributed as y~Binary(0.5).
If D=1:
Randomize one of x1 x2, x3, x4 as a true covariate x.
If x=x4:
Let z be the indicator for the third level, i.e., let z=1 if x=“3” and z=0 otherwise.
Calculate the linear predictor as p=0.45z.
Else:
Calculate the linear predictor as p=0.45x.
Calculate the expected value of y as μ=1 /(1+exp(-p)).
In Algorithm 1,
For each model choice algorithm, three measures of accuracy were calculated, i.e., specificity, sensitivity-1 and sensitivity-2. Specificity was defined as
Sensitivity-1 was defined as
Thus, Sens2≥Sens1 always, as the criterion involved in calculating Sensitivity-2 is easier to satisfy. Sensitivity-1 measures the methods’ ability to detect a true nonempty model, whereas sensitivity-2 measures their ability to detect a true covariate at the cost of including noise covariates in the model. Additionally, a loss function was defined as
The main model comparison was run by using Algorithm 1 with n
Method 1 | Method 2 | Method 3 | Method 4 | Method 5 | Method 6 | Method 7 | Method 8 | Method 9 | ||
---|---|---|---|---|---|---|---|---|---|---|
n=100 | Spec | .79 [.004] | .78 [.004] | .51 [.005] | .89 [.003] | .54 [.005] | .91 [.003] | .00 [.000] | .66 [.005] | .53 [.005] |
Sens1 | .31 [.005] | .27 [.004] | .33 [.005] | .28 [.004] | .34 [.005] | .27 [.004] | .01 [.001] | .17 [.004] | .20 [.004] | |
Sens2 | .38 [.005] | .42 [.005] | .56 [.005] | .31 [.005] | .54 [.005] | .29 [.005] | .86 [.003] | .53 [.005] | .58 [.005] | |
n=350 | Spec | .78 [.004] | .80 [.004] | .51 [.005] | .95 [.002] | .52 [.005] | .95 [.002] | .00 [.000] | .67 [.005] | .54 [.005] |
Sens1 | .61 [.005] | .56 [.005] | .51 [.005] | .57 [.005] | .51 [.005] | .57 [.005] | .00 [.000] | .30 [.005] | .36 [.005] | |
Sens2 | .73 [.004] | .76 [.004] | .84 [.004] | .59 [.005] | .83 [.004] | .59 [.005] | .96 [.002] | .82 [.004] | .87 [.003] | |
n=1000 | Spec | .78 [.004] | .79 [.004] | .52 [.005] | .97 [.002] | .52 [.005] | .97 [.002] | .00 [.000] | .67 [.005] | .54 [.005] |
Sens1 | .79 [.004] | .77 [.004] | .60 [.005] | .79 [.004] | .60 [.005] | .79 [.004] | .00 [.000] | .37 [.005] | .49 [.005] | |
Sens2 | .95 [.002] | .96 [.002] | .98 [.001] | .80 [.004] | .98 [.001] | .80 [.004] | .99 [.001] | .97 [.002] | .99 [.001] |
There is always a trade-off between sensitivity and specificity. In this study, sensitivity-2 goes theoretically to 1.00 when specificity goes to 0.00, and vice versa. These extreme values can be produced by choosing always an empty model or always including all covariates in the model. It depends on the preferences of each research team how to tackle this trade-off. Figure 2 illustrates the dilemma. It shows the efficient frontier of the loss function defined in Methods. E.g., for
The Appendix of this paper presents three alternative scenarios: One with a larger effect size, one with two effects and one with a misspecified model structure. As was expected, the stronger the effect, the easier it is to detect. Thus, in the strong effect’s scenario, all methods have a high sensitivity-2. In the two effects’ scenario, the two effects are confounded by each other. In this case, the sensitivities of all methods severely fall. The same applies to a lesser degree on the misspecified model’s scenario. However, the big picture as measured by the choice of optimal model (Figure 2) changes surprisingly little. In all these scenarios, a conservative researcher should choose BIC or BICc and only after some
Finally, likelihood ratio test and MDL were removed from the comparison because of their weird behavior in some of the scenarios and Figure 2 was re-run (see Appendix). In this analysis, the take-home message is to use BIC or BICc for a low
This paper has presented a systematic comparison of nine model choice algorithms for use with logistic regression models. Out of the methods presented, four are information criteria (AIC, BIC, AICc and BICc), two are regularized regressions (Lasso and Adaptive Lasso) and two are based on commonly used statistical tests (Wald’s tests and the likelihood-ratio test). Even though derived from different principles, MDL is used similarly to the information criteria, i.e., it is a target function to be minimized.
Wald’s tests are not strictly speaking a model choice method, but a heuristic to construct a single model. Likelihood ratio test is typically used to test nested models against each other. In this paper, it is used to score and rank all model variants. This may lead to some strange behaviors as noted in the Appendix. All methods presented here have their respective strengths and weaknesses. Whereas a conservative researcher might prefer BIC or BICc, a more liberally minded one might prefer likelihood-ratio test, Wald’s tests or Lasso. It should be noted that these results concern small model choice problems, i.e., ones with few covariates. When the number of covariates increases, it soon becomes impractical to score all models which limits the accuracy of the information criteria. On the other hand, in massively multivariate data sets, missing values often limit the accuracy of Lasso and Adaptive Lasso. Furthermore, the optimal methods are here calculated on basis of sensitivity-2. If sensitivity-1 is used instead, a different pattern may arise.
There are many developments of Akaike’s original ideas, among them the idea of correcting AIC for small-sample bias (Cavanaugh, 1997; Imori et al., 2014; Sugiura, 1978). Implementing the method of Imori et al. (2014) is left for future work. The extension of this study to other model types, e.g., Gaussian regressions or time-series models, is also left for future work.
Finally, it should be noted that the inclusion of some covariate in a statistical model does not mean that it is significant or that the effect is true. On the contrary, the significance should be assessed based on statistical tests, and ever importantly, the effect size. Thus, the multivariate models produced by model choice algorithms should be regarded only as starting points of long discussions.