1 Introduction
Separation commonly occurs in political science, usually when a binary explanatory variable perfectly predicts a binary outcome (e.g., Gustafson Reference Gustafson2020; Mehltretter Reference Mehltretter2022; Owsiak and Vasquez Reference Owsiak and Vasquez2021).Footnote 1 For example, Barrilleaux and Rainey (Reference Barrilleaux and Rainey2014) find that being a Democrat perfectly predicts a governor supporting Medicaid expansion under the Affordable Care Act. Under separation, the usual maximum likelihood estimate is unreasonably large and the Wald test is highly misleading.
As a solution, some methodologists propose using a Bayesian prior distribution to regularize the estimates, which we can alternatively consider as a penalized maximum likelihood estimator. Zorn (Reference Zorn2005; see also Heinze and Schemper Reference Heinze and Schemper2002) points political scientists toward Firth’s (Reference Firth1993) penalized maximum likelihood estimator, which is equivalent to Jeffreys prior distribution. Gelman et al. (Reference Gelman, Jakulin, Pittau and Yu-Sung2008), on the other hand, recommend a Cauchy prior distribution. Both methods ensure finite estimates in theory and usually produce reasonably sized estimates in practice. Methodologists continue to recommend these penalized or Bayesian estimators as a solution to separation (e.g., Anderson, Bagozzi, and Koren Reference Anderson, Bagozzi and Koren2021; Cook, Hays, and Franzese Reference Cook, Hays and Franzese2020; Cook, Niehaus, and Zuhlke Reference Cook, Niehaus and Zuhlke2018; Crisman-Cox, Gasparyan, and Signorino Reference Crisman-Cox, Gasparyan and Signorino2023).
But Rainey (Reference Rainey2016) points out that the estimates (and especially the confidence intervals) depend largely on the chosen prior. Many priors that produce finite estimates also produce meaningfully different conclusions. He argues that the set of a priori “reasonable” and “implausible” parameters depends on the substantive application, so context-free defaults (like Jeffreys and Cauchy priors) might not produce reasonable results. Starkly emphasizing this point, Beiser-McGrath (Reference Beiser-McGrath2022) shows that Jeffreys prior can lead to (statistically significant) estimates in the opposite direction of the separation. Rainey (Reference Rainey2016) concludes that “[w]hen facing separation, researchers must carefully choose a prior distribution to nearly rule out implausibly large effects” (354). Given the sensitivity of the result to the chosen prior distribution, how can researchers make their analysis more compelling? In particular, can they obtain useful p-values to test hypotheses about model coefficients in the usual frequentist framework without injecting prior information into their model?
I show that while the popular Wald test produces misleading (even nonsensical) p-values under separation, likelihood ratio tests and score tests behave in the usual manner. Thus, researchers can produce meaningful p-values with standard frequentist tools under separation without the use of a penalty or prior. A complete analysis of a data set with separation will usually include penalized or Bayesian estimates to obtain reasonable estimates of quantities of interest, but a hypothesis test without a penalty or prior can more convincingly establish that the most basic claim holds: the separating variable has a positive (or negative) effect.
2 Hypothesis Tests under Separation
Maximum likelihood provides a general and powerful framework for obtaining estimates of model parameters and testing hypotheses. In our case of logistic regression, we write the probability $\pi _i$ that an event occurs for observation i of n (or that the outcome variable $y_i = 1$ ) as $\pi _i = \text {logit}^{-1}(X_i\beta )\text { for} i = 1, 2, ... , n \text {, }$ where X represents a matrix of explanatory variables and $\beta $ represents a vector of coefficients. Then we have the likelihood function $L(\beta \mid y) = \prod _{i=1}^{n} \pi _{i}^{y_i}(1 - \pi _{i})^{(1 - y_i)}$ and the log-likelihood function $\ell (\beta \mid y) = \log L(\beta \mid y) = \sum _{i=1}^{n} [y_i \log (\pi _{i}) + (1 - y_i) \log (1 - \pi _{i})]$ . Researchers typically use numerical algorithms to locate the maximum likelihood estimate $\hat {\beta }^{ML}$ that maximizes $\ell $ and then use certain features of $\ell $ to test hypotheses. To fix ideas, I focus on the point null hypothesis $H_0: \beta _s = 0$ . However, the intuition and conclusions generalize to more complex hypotheses.
The literature offers three common methods to assess the null hypothesis—the “holy trinity” of hypothesis tests: the Wald test, the likelihood ratio test, and the score test (also known as the Lagrange multiplier test). For practical reasons, most regression tables in political science report Wald p-values. However, the Wald test is uniquely ill-suited for testing hypotheses under separation. Because the usual Wald test fails, some researchers turn immediately penalized estimators (e.g., Bell and Miller Reference Bell and Miller2015) or Bayesian inference (e.g., Barrilleaux and Rainey Reference Barrilleaux and Rainey2014). However, the usual likelihood ratio and score tests work as expected under separation. Thus, researchers can use the likelihood ratio or score test to evaluate the core hypothesis that the separating variable has a positive (or negative) effect before turning to penalized or Bayesian methods to estimate quantities of interest. Below, I briefly describe each test, explain why the Wald test works poorly under separation, and describe why the likelihood ratio and score tests perform better.
2.1 Wald Test
The Wald test uses the shape of the log-likelihood function around the maximum to estimate the precision of the point estimate. If small changes in the parameter near the maximum lead to large changes in the log-likelihood function, then we can treat the maximum likelihood estimate as precise. We usually estimate the standard error $\widehat {\text {SE}}(\hat {\beta }_i^{ML})$ as
Wald (Reference Wald1943) advises us how compare the estimate with the standard error: the statistic $Z_w = \dfrac {\hat {\beta }_i^{ML}}{\widehat {\text {SE}}(\hat {\beta }_i^{ML})}$ approximately follows a standard normal distribution (Casella and Berger Reference Casella and Berger2002, 492–493; Greene Reference Greene2012, 527–529).
The Wald approach works poorly when dealing with separation. Under separation, the log-likelihood function at the numerical maximum is nearly flat. The flatness produces very large standard error estimates—much larger than the coefficient estimates. Figure 1 shows this intuition for a typical, non-monotonic log-likelihood function (i.e., without separation) and a monotonic log-likelihood function (i.e., with separation). In the absence of separation, the curvature of the log-likelihood function around the maximum speaks to the evidence against the null hypothesis. But under separation, the monotonic likelihood function is flat at the maximum, regardless of the relative likelihood of the data under the null hypothesis.
We can develop this intuition more precisely and formally. Suppose that a binary explanatory variable s with coefficient $\beta _s$ perfectly predicts the outcome $y_i$ such that when $s_i = 1$ then $y_i = 1$ . Then the log-likelihood function increases in $\beta _s$ . The standard error estimate associated with each $\beta _s$ increases as well. Critically, though, the estimated standard error increases faster than the associated coefficient, because $\lim _{\beta _s \to \infty } \left [ \left ( - \dfrac {\partial ^2 \ell (\beta _s \mid y)}{\partial ^2 \beta _s} \right )^{-\frac {1}{2}} - \beta _s \right ] = \infty $ . Thus, under separation, the estimated standard error will be much larger than the coefficient for the separating variable. This implies two conclusions. First, so long as the researcher uses a sufficiently precise algorithm, the Wald test will never reject the null hypothesis under separation, regardless of the data set. Second, if the Wald test can never reject the null hypothesis for any data set with separation, then the power of the test is strictly bounded by the chance of separation. In particular, the power of the test cannot exceed $1 - Pr(separation)$ . If the data set features separation in nearly 100% of repeated samples, then the Wald test will have power near 0%.
As a final illustration, suppose an absurd example in which a binary treatment perfectly predicts 500 successes and 500 failures (i.e., $y = x$ always). Of course, this data set is extremely unlikely under the null hypothesis that the coefficient for the treatment indicator equals zero. The exact p-value for the null hypothesis that successes and failures are equally likely under both treatment and control equals $2 \times \left ( \frac {1}{2} \right )^{500} \times \left ( \frac {1}{2} \right )^{500} = \frac {2}{2^{1000}} \approx \frac {2}{10^{301}}$ . (For comparison, there are about $10^{80}$ atoms in the known universe.) Yet the default glm() routine in R calculates a Wald p-value of 0.998 with the default precision (and 1.000 with the maximum precision). When dealing with separation, the Wald test breaks down; researchers cannot use the Wald test to obtain reasonable p-values for the coefficient of a separating variable.Footnote 2
2.2 Likelihood Ratio Test
The likelihood ratio test resolves the problem of the flat log-likelihood by comparing the maximum log-likelihood of two models: an “unrestricted” model $ML$ that imposes no bounds on the estimates and a “restricted” model $ML_0$ that constrains the estimates to the region suggested by the null hypothesis. If the data set is much more likely under the unrestricted estimate than under the restricted estimate, then the researcher can reject the null hypothesis. Wilks (Reference Wilks1938) advises us how to compare the unrestricted log-likelihood $\ell (\hat {\beta }^{ML} \mid y)$ to the restricted log-likelihood $\ell (\hat {\beta }^{ML_0} \mid y)$ : $D = 2 \times \left [ \ell (\hat {\beta }^{ML} \mid y) - \ell (\hat {\beta }^{ML_0} \mid y) \right ]$ approximately follows a $\chi ^2$ distribution with degrees of freedom equal to the number of constrained dimensions (Casella and Berger Reference Casella and Berger2002, 488–492; Greene Reference Greene2012, 526–527).
Figure 1 shows the intuition of the likelihood ratio test. The gap between the unrestricted and restricted maximum summarizes the evidence against the null hypothesis. Importantly, the logic does not break down under separation. Unlike the Wald test, the likelihood ratio test can reject the null hypothesis under separation.Footnote 3
2.3 Score Test
The score test (or Lagrange multiplier test) resolves the problem of the flat log-likelihood by evaluating the gradient of the log-likelihood function at the null hypothesis. If the log-likelihood function is increasing rapidly at the null hypothesis, this casts doubt on the null hypothesis. The score test uses the score function $S(\beta ) = \dfrac {\partial \ell (\beta \mid y)}{\partial \beta }$ and the Fisher information $I(\beta ) = -E_\beta \left ( \dfrac {\partial ^2 \ell (\beta \mid y)}{\partial ^2 \beta } \right )$ . When evaluated at the null hypothesis, the score function quantifies the slope and the Fisher information quantifies the variance of that slope in repeated samples. If the score at the null hypothesis is large, then the researcher can reject the null hypothesis. Rao (Reference Rao1948) advises us how to compare the score to its standard error: $Z_s = \frac {S(\beta ^0_s)}{\sqrt {I(\beta ^0_s)}}$ follows a standard normal distribution (Casella and Berger Reference Casella and Berger2002, 494–495; Greene Reference Greene2012, 529–530).
Figure 1 shows the intuition of the score test. The slope of the log-likelihood function under the null hypothesis summarizes the evidence against the null hypothesis. As with the likelihood ratio test, the logic works even under separation, and the score test can reject the null hypothesis under separation.
Table 1 summarizes the three tests. For further discussion of the connections among the tests, see Buse (Reference Buse1982). Most importantly, the likelihood ratio and score tests rely on features of the log-likelihood function that are not meaningfully affected by a monotonic log-likelihood function. The Wald test, on the other hand, cannot provide a reasonable test under separation.
3 Simulations
To evaluate the performance of the three tests under separation, I use a Monte Carlo simulation to compute the power functions for a diverse collection of data-generating processes (DGPs). For each DGP, I compute the probability of rejecting the null hypothesis as the coefficient for the potentially separating explanatory variable varies from $-$ 5 to 5. For a properly functioning test, the power function should be about 5% when $\beta _s = 0$ (i.e., the “size” of the test) and grow quickly toward 100% as $\beta _s$ moves away from zero (i.e., the “power” of the test).
Importantly, I cannot focus on data sets with separation because separation is a feature of a particular sample. Instead, I focus on DGPs that sometimes feature separation (e.g., in 15% of repeated samples or in 50% of repeated samples). To develop these DGPs, I imagine the logistic regression model $\Pr (y = 1) = \text {logit}^{-1}(\beta _{\text {cons}} + \beta _s s + \beta _{z_1} z_1 + \cdots + \beta _{z_k} z_k)$ and a researcher testing the null hypothesis that the binary explanatory variable s (that might produce separation) has no effect on a binary outcome variable y (i.e., that $\beta _s = 0$ ).
I generate a diverse collection of 150 DGPs using the following process. First, I choose the total number of observations randomly from $\{50, 100, 1000\}$ . Then I choose the frequency that $s = 1$ ( $\sum s$ ) from $\{5, 10, 25, 50, 100\}$ (subject to the constraint that $\sum s$ must be less than the total number of observations). Next, I draw the value of the constant term ( $\beta _{\text {cons}}$ ) from a continuous uniform distribution from $-$ 5 to 0, the number of control variables (k) from a uniform distribution from 0 to 6, and the correlation among the explanatory variables ( $\rho $ ) from a continuous uniform distribution from 0 to 0.5.Footnote 4 I simulate many of these DGPs and keep 150 that feature (1) separation in a least 30% of repeated samples for some $\beta _s \in [-5, 5]$ and (2) variation in the outcome variable in at least 99.9% of repeated samples. For each of the 150 DGPs, I use Monte Carlo simulations to compute the power function for each of the three tests discussed above.Footnote 5 For comparison, I also compute the power function for Wald tests using Firth’s (Reference Firth1993) penalty and Gelman et al.’s (Reference Gelman, Jakulin, Pittau and Yu-Sung2008) Cauchy penalty.
3.1 A Close Look at a Single DGP
First, I describe the results for a single DGP. For this particular DGP, there are 1,000 total observations, $s = 1$ for only five of the observations and $s = 0$ for the other 995 observations, the constant term $\beta _{cons}$ equals $-$ 4.1, there are three control variables, and the latent correlation $\rho $ among the explanatory variables is 0.06. Table 2 shows the power function for each test and the chance of separation as $\beta _s$ varies. Separation is relatively rare when $\beta _s$ —the coefficient for the potentially separating variable—is between $-$ 0.5 and 2.0. But for $\beta _s$ above 2.0 or below $-$ 0.50, separation becomes more common. For $\beta _s$ larger than about 4.0 and smaller than about $-$ 2.0, though, a majority of the data sets feature separation.
These power functions clearly demonstrate the poor performance of the Wald test. Even though the data sets with separation should allow the researcher to reject the null hypothesis, at least occasionally, the power of the Wald test is low even for very large effects. This happens because the Wald test cannot reject the null hypothesis under separation. The cases where $\beta _s = 4.0$ and $\beta _s = 5.0$ show this clearly. The Wald test fails to reject when separation exists, but does reject the null hypothesis when separation does not exist (i.e., when the sample effects are smaller).
The likelihood ratio and score tests, on the other hand, perform as expected. For both, the power of the test when $\beta _s = 0$ is about 5%, as designed, and the power approaches 100% relatively quickly as $\beta _s$ moves away from zero. This table also shows the Wald tests for Firth’s and the Cauchy penalty. Compared to the likelihood ratio and score tests, the Wald test using the Cauchy penalty is under-powered, especially (but not only) for negative values of $\beta _s$ , and the Wald test using Firth’s penalty rejects the null hypotheses far too often under the null. I emphasize that I selected this particular DGP to highlight tendencies in the larger collection, but this particular DGP is not necessarily representative in all respects. See Figure 3 for a more diverse collection.
3.2 A Broad Look at Many DGPs
Using the algorithm I describe above, I create a diverse collection of 150 DGPs. Figure 2 shows the power (i.e., the probability of rejecting the null hypothesis) of each test as the chance of separation varies across the many scenarios. Each point shows the power for a particular scenario (where $\beta _s \neq 0$ , though some $\beta _s$ are small). Most starkly, the power of the Wald test is bounded above by $1 - Pr(\text {separation})$ , and many scenarios achieve the boundary. Intuitively, as the chance of separation increases, the power of a properly functioning test should increase as well, because separation is evidence of a large coefficient. But because a large coefficient makes separation more likely, a large coefficient decreases the power of the Wald test. The likelihood ratio test, the score test, and the two Wald tests using penalized estimates do not exhibit this pattern.
Figure 3 shows the power function for each of the 150 DGPs in the diverse collection. Each of the many lines shows the power function for a particular DGP as $\beta _s$ varies. First, the power functions for the Wald tests show its consistently poor properties. For most of the Wald power functions, as the true coefficient grows larger in magnitude from about two or three, the test becomes less powerful. This occurs because separation becomes more likely and the test cannot reject the null hypothesis when separation occurs. Second, the likelihood ratio and score tests behave reasonably well. Most importantly, the size of the likelihood ratio and score tests is about 5% when the coefficient equals zero and grows as the coefficient moves away from zero. Third, the Wald tests using the penalized estimates exhibit some troubling patterns. For the Cauchy penalty, the tests seem under-powered relative to the likelihood ratio and score tests. For Firth’s penalty, the chances of rejection when $\beta _s = 0$ seem high (e.g., 25% or more) for many DGPs.
Figure 4 summarizes the many power functions in Figure 3 using the median power across all 150 DGPs. The solid, dark line shows the median power, and the two dashed lines show the 25th and 75th percentiles. This figure clearly shows the unusual behavior of the Wald test—the power decreases when the magnitude of the coefficient grows larger than about two or three. Further, it shows that both the likelihood ratio and score tests work well. For both tests, the chance of rejection is about 5% when the coefficient equals zero and grows quickly as the coefficient moves away from zero. This figure also shows that the likelihood ratio tests tend to be slightly more powerful than the score tests. The problematic patterns for the penalized estimators appear here as well. For the Cauchy penalty, the Wald tests can have relatively low power. For Firth’s penalty, the Wald tests can reject the null hypothesis far too often when the null hypothesis is true.
To further see the behavior under separation, I divide the scenarios into three categories: low chance of separation, where the chance of separation is less than 10%; moderate chance of separation, between 10% and 30%; and high chance of separation, greater than 30%. Figure 5 shows the power for the various scenarios.
The bottom panel of Figure 5 is particularly helpful. When the chance of separation is high, the Wald tests rarely reject the null hypothesis. The likelihood ratio and score tests, on the other hand, still function as expected. Again, the likelihood ratio tests tend to exhibit slightly greater power than the score tests. The penalized estimates perform noticeably worse. Using the Cauchy penalty, the Wald test is under-powered compared to both the likelihood ratio and score tests. Using Firth’s penalty, the results are even worse. Many of these tests reject the null in 50% or more repeated samples when the null hypothesis that $\beta _s = 0$ is correct.
Finally, Figure 6 plots the size of the test (i.e., the chance of rejecting the null hypothesis that $\beta _s = 0$ when the null hypothesis is true) as the chance of separation varies for each of the 150 DGPs. Ideally, the size should be about 5%. The Wald, likelihood ratio, and score tests all have reasonable size. The size of the Wald tests falls between 2% and 5%, depending on the chance of separation. (The problem with the Wald tests is power, not size.) The size of likelihood ratio tests falls between about 2.5% and about 10%. Notably, the likelihood ratio tests are over-sized when separation is relatively unlikely. The size of the score tests falls around 5% regardless of the chance of separation. The Wald tests using the penalized estimates perform worse. Using the Cauchy penalty, the Wald test is under-sized, around 2% regardless of the chance of separation. Using Firth’s penalty, the Wald test performs surprisingly poorly. For some DGPs, the Wald tests using Firth’s penalized estimates always reject the null hypothesis when separation occurs, even when separation is common under the null hypothesis.Footnote 6 This underscores the advice of Rainey (Reference Rainey2016) and Beiser-McGrath (Reference Beiser-McGrath2022) to treat default penalties with care.
4 Concrete Recommendations
Given the arguments above, how should researchers proceed when facing separation? I offer the following four suggestions, which incorporate the arguments above with the larger literature. Importantly, I view the likelihood ratio and/or score tests as a supplement to (not a replacement for) penalized estimation (e.g., Bell and Miller Reference Bell and Miller2015) or Bayesian inference with informative priors (e.g., Barrilleaux and Rainey Reference Barrilleaux and Rainey2014).
-
1. Identify separation. Software varies in how and whether it detects and reports the presence of separation. Become familiar with your preferred software.Footnote 7
-
2. Do not drop the separating variable. If a variable creates separation, then researchers might be tempted to omit the offending variable from the model. This is poor practice. See Zorn (Reference Zorn2005, 161–162) for more details.
-
3. Test the hypothesis about the coefficient of the separating variable. While the maximum likelihood estimate of the coefficient might be implausible (see Rainey Reference Rainey2016) and the Wald p-values nonsensical (see above), researchers can still use a likelihood ratio and/or score test to obtain a useful p-value and test the null hypothesis that the coefficient for the separating variable equals zero.Footnote 8 The researcher can report this test in the text of the paper and/or in a regression table, carefully distinguishing the likelihood ratio and/or score tests from the Wald test readers expect. In particular, I recommend the following three changes to the standard regression table:
-
a. Replace the finite numerical maximum likelihood estimate with the theoretical maximum likelihood estimate of $\infty $ or $-\infty $ . Researchers can code the binary separating variable so that values of 1 perfectly predict the outcome. This makes the ML estimate of the intercept finite and interpretable.Footnote 9
-
b. Omit the standard error estimate for the separating variable.
-
c. Replace the Wald p-value for the coefficient of the separating variable with the likelihood ratio p-value.Footnote 10 Clearly indicate this change. The simulations above suggest that the likelihood ratio test works marginally better than the score test in scenarios that commonly feature separation, so I suggest that researcher report the likelihood ratio test by default. In the table note, clearly explain the rationale for using the alternative test and supply the p-value from the score test as additional information. There is no need to replace the Wald p-values for variables that do not create separation with likelihood ratio or score p-values. The usual standard errors are meaningful, and the Wald p-values work well for these variables that do not create separation, even when another variable in the model does create separation.Footnote 11
-
-
4. Estimate the coefficients and uncertainty using penalized maximum likelihood or Bayesian estimation and compute substantively meaningful quantities of interest. Firth (Reference Firth1993; Zorn Reference Zorn2005) and Gelman et al. (Reference Gelman, Jakulin, Pittau and Yu-Sung2008) offer reasonable default penalties or prior distributions that might work well for a given application. However, Rainey (Reference Rainey2016) and Beiser-McGrath (Reference Beiser-McGrath2022) show that the inferences can meaningfully depend on the chosen penalty or prior. With this sensitivity in mind, the researcher should choose the penalty or prior carefully and demonstrate the robustness of their conclusions to alternative prior specifications. Researchers using penalized maximum likelihood can use the informal posterior simulation procedure suggested by King, Tomz, and Wittenberg (Reference King, Tomz and Wittenberg2000; see also Gelman and Hill Reference Gelman and Hill2006) to compute point estimates and confidence intervals for the quantities of interest. See Bell and Miller (Reference Bell and Miller2015) for an example. Researchers using full posterior simulation can transform the simulations of the model coefficients to obtain posterior simulations of the quantities of interest. See Barrilleaux and Rainey (Reference Barrilleaux and Rainey2014) for an example. While researchers should rely primarily on a model with a thoughtful penalty or prior, it can be helpful to also report estimates using both Firth’s (Reference Firth1993) and Gelman et al.’s (Reference Gelman, Jakulin, Pittau and Yu-Sung2008) default priors so that readers have a common benchmark.
a Being a Democratic governor perfectly predicts non-opposition, so these data feature separation. While the numerical maximum likelihood algorithm returns a finite estimate (about $-$ 20 for default precision and $-$ 36 for maximum precision), the maximum likelihood estimate is actually $-\infty $ .
b Following the advice I develop above, I replace the default Wald p-value with a likelihood ratio p-value for this particular coefficient. The Wald test for maximum likelihood estimates relies on unreasonable standard errors that produce nonsensical p-values. However, the likelihood ratio and score tests produce reasonable p-values. The score test is another suitable alternative and produces a p-value of 0.009. The remainder of the p-values for all three models are from Wald tests.
5 Re-Analysis of Barrilleaux and Rainey (Reference Barrilleaux and Rainey2014)
To illustrate the power and simplicity of frequentist hypothesis tests under separation, I reanalyze data from Barrilleaux and Rainey (Reference Barrilleaux and Rainey2014), who examine U.S. state governors decisions' to support or oppose the Medicaid expansion under the 2010 Affordable Care Act. Because no Democratic governors oppose the expansion, separation occurs—being a Democratic governor perfectly predicts non-opposition.
I focus on their first hypothesis: Republican governors are more likely to oppose the Medicaid expansion funds than Democratic governors. Barrilleaux and Rainey adopt a fully Bayesian approach, modeling the probability that a state’s governor opposes the Medicaid expansion as a function of the governor’s partisanship and several other covariates. Here, I re-estimate their logistic regression model and test their hypothesis using the likelihood ratio and score tests.
Table 3 illustrates how a researcher can implement the third concrete suggestion above (i.e., “test the hypothesis about the coefficient of the separating variable”). Table 3 does the following: (1) replaces the finite estimates returned by R’s glm() function with the correct estimate of $-\infty $ and describes this change in footnote (a); (2) omits the problematic standard error estimate; and (3) replaces the usual Wald p-value with the likelihood ratio p-value, clearly indicates this change, and explains the reason in footnote (b) (and leaves the remaining Wald p-values unchanged).
Substantively, Table 3 shows that the likelihood ratio test unambiguously rejects the null hypothesis that the coefficient for Democratic governors equals zero. That is, Democratic governors are less likely to oppose the Medicaid expansion than their Republican counterparts. The likelihood ratio and score p-values are 0.003 and 0.009, respectively.Footnote 12 This contrasts with the default penalized estimators, which produce a less-convincing pair of results. Firth’s penalty gives a p-value of 0.060, and Gelman et al.’s (Reference Gelman, Jakulin, Pittau and Yu-Sung2008) suggested Cauchy penalty gives a p-value of 0.038.
In a complete analysis, the researcher should also compute substantively meaningful quantities of interest. While this (usually) requires a penalty or a prior, these estimates are a critical part of a complete analysis of a logistic regression model with separation. Barrilleaux and Rainey (Reference Barrilleaux and Rainey2014), Bell and Miller (Reference Bell and Miller2015), and Rainey (Reference Rainey2016) offer examples of this important component. As Rainey (Reference Rainey2016) emphasizes, though, estimates and confidence intervals for quantities of interest can depend heavily on the penalty or prior, so the researcher must choose their prior carefully and explore the robustness of results to other prior specifications.
5 Conclusion
Separation commonly occurs in political science. When this happens, I show that the usual p-values based on a Wald test are highly misleading. Zorn (Reference Zorn2005) and Gelman et al. (Reference Gelman, Jakulin, Pittau and Yu-Sung2008) suggest that substantive researchers use penalized maximum likelihood to obtain reasonable point estimates and standard errors. However, Rainey (Reference Rainey2016) and Beiser-McGrath (Reference Beiser-McGrath2022) urge substantive scholars to apply these default penalties cautiously. In this paper, I show that substantive researchers can use the usual likelihood ratio and score tests to test hypotheses about the coefficients, even under separation. While estimating quantities of interest (usually) requires a penalty or prior, researchers can use likelihood ratio or score tests to produce meaningful p-values under separation without using penalties or prior information.
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2023.28.
Data Availability Statement
All data and code for the paper are available on the Open Science Framework (OSF) at https://doi.org/10.17605/OSF.IO/WN2S4 (Rainey Reference Rainey2023a) and Dataverse at https://doi.org/10.7910/DVN/6EYRJG (Rainey Reference Rainey2023b). A computational companion that illustrates how one can compute the quantities I discuss in the paper is available in the Supplementary Material on the publisher's website and in the OSF and Dataverse repositories.
Acknowledgments
I am grateful to an especially thoughtful and careful pool of peer reviewers that helped me make this paper better.