Logistic Regression in Rare Events Data

Gary King; Langche Zeng

doi:10.1093/oxfordjournals.pan.a004868

Logistic Regression in Rare Events Data

Published online by Cambridge University Press: 04 January 2017

Gary King and

Langche Zeng

Show author details

Gary King: Affiliation:
Center for Basic Research in the Social Sciences, 34 Kirkland Street, Harvard University, Cambridge, MA 02138. e-mail: [email protected]://GKing.Harvard.Edu
Langche Zeng: Affiliation:
Department of Political Science, George Washington University, Funger Hall, 2201 G Street NW, Washington, DC 20052. e-mail: [email protected]

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.

Type: Research Article
Information: Political Analysis , Volume 9 , Issue 2 , 2001 , pp. 137 - 163

DOI: https://doi.org/10.1093/oxfordjournals.pan.a004868 [Opens in a new window]
Copyright: Copyright © 2001 by the Society for Political Methodology

References

Achen, Christopher A. 1999. “Retrospective Sampling in International Relations,” Presented at the annual meetings of the Midwest Political Science Association, Chicago.Google Scholar

Agresti, A. 1992. “A Survey of Exact Inference for Contingency Tables (with discussion).” Statistical Science 7(1): 131–177.Google Scholar

Amemiya, Takeshi, and Vuong, Quang H. 1987. “A Comparison of Two Consistent Estimators in the Choice-Based Sampling Qualitative Response Model.” Econometrica 55(3): 699–702.CrossRef Google Scholar

Beck, Nathaniel, King, Gary, and Zeng, Langche. 2000. “Improving Quantitative Studies of International Conflict: A Conjecture.” American Political Science Review 94(1): 1–15. (Preprint at http://GKing.Harvard.Edu.)CrossRef Google Scholar

Bennett, D. Scott, and Stam, Allan C. III. 1998a. EUGene: Expected Utility Generation and Data Management Program, Version 1.12. http://wizard.ucr.edu/cps/eugene/eugene.html.Google Scholar

Bennett, D. Scott, and Stam, Allan C. III. 1998b. “Theories of Conflict Initiation and Escalation: Comparative Testing, 1816–1980,” Presented at the annual meeting of the International Studies Association Minneapolis.Google Scholar

Breslow, Norman E. 1996. “Statistics in Epidemiology: The Case-Control Study.” Journal of the American Statistical Association 91: 14–28.CrossRef Google Scholar PubMed

Breslow, Norman E., and Day, N. E. 1980. Statistical Methods in Cancer Research. Lyon: International Agency for Research on Cancer.Google Scholar PubMed

Bueno de Mesquita, Bruce. 1981. The War Trap. New Haven, CT: Yale.Google Scholar

Bueno de Mesquita, Bruce, and Lalman, David. 1992. War and Reason: Domestic and International Imperatives. New Haven, CT: Yale University Press.Google Scholar

Bull, Shelley B., Greenwood, Celia M. T., and Hauck, Walter W. 1997. “Jackknife Bias Reduction for Polychotomous Logistic Regression.” Statistics in Medicine 16: 545–560.Google Scholar

Cordeiro, Gauss M., and McCullagh, Peter. 1991. “Bias Correction in Generalized Linear Models.” Journal of the Royal Statistical Society, B 53(3): 629–643.Google Scholar

Cosslett, Stephen R. 1981a. “Maximum Likelihood Estimator for Choice-Based Samples.” Econometrica 49(5): 1289–1316.CrossRef Google Scholar

Cosslett, Stephen R. 1981b. “Efficient Estimation of Discrete-Choice Models.” In Structural Analysis of Discrete Data with Econometric Applications, eds. Manski, Charles F. and McFadden, Daniel. MIT Press. MA: Cambridge.Google Scholar

Firth, David. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80(1): 27–38.CrossRef Google Scholar

Geisser, Seymour. 1993. Predictive Inference: An Introduction. New York: Chapman and Hall.CrossRef Google Scholar

Geller, Daniel S., and David Singer, J. 1998. Nations at War: A Scientific Study of International Conflict. New York: Cambridge University Press.CrossRef Google Scholar

Greene, William H. 1993. Econometric Analysis, 2nd ed. New York: Macmillan.Google Scholar

Holland, Paul W., and Rubin, Donald B. 1988. “Causal Inference in Retrospective Studies,” Evaluation Review 12(3): 203–231.CrossRef Google Scholar

Hsieh, David A., Manski, Charles F., and McFadden, Daniel. 1985. “Estimation of Response Probabilities from Augmented Retrospective Observations.” Journal of the American Statistical Association 80(391): 651–662.CrossRef Google Scholar

Huth, Paul K. 1988. “Extended Deterrence and the Outbreak of War.” American Political Science Review 82(2): 423–443.Google Scholar

Imbens, Guido. 1992. “An Efficient Method of Moments Estimator for Discrete Choice Models with Choice-Based Sampling.” Econometrica 60(5): 1187–1214.CrossRef Google Scholar

King, Gary, and Zeng, Langche. 2000a. “Inference in Case-Control Studies with Limited Auxilliary Information” (in press). (Preprint at http://Gking.harvard.edu.)Google Scholar

King, Gary, and Zeng, Langche. 2000b. “Explaining Rare Events in International Relations.” International Organization (in press).Google Scholar

King, Gary, Keohane, Robert O., and Verba, Sidney. 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton, NJ: Princeton University Press.CrossRef Google Scholar

King, Gary, Tomz, Michael, and Wittenberg, Jason. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44(2): 341–355. (Preprint at http://Gking.harvard.edu.)CrossRef Google Scholar

Lancaster, Tony, and Imbens, Guido. 1996a. “Case-Control with Contaminated Controls.” Journal of Econometrics 71: 145–160.CrossRef Google Scholar

Lancaster, Tony, and Imbens, Guido. 1996b. “Efficient Estimation and Stratified Sampling.” Journal of Econometrics 74: 289–318.Google Scholar

Levy, Jack S. 1989. “The Causes of War: A Review of Theories and Evidence.” In Behavior, Society, and Nuclear War, Vol. 1, eds. Tetlock, Phillip E., Husbands, Jo L., Jervis, Robert, Stern, Paul C., and Tilly, Charles. New York, Oxford: Oxford University Press, pp. 2120–2333.Google Scholar

Manski, Charles F. 1999. “Nonparametric Identification Under Response-Based Sampling.” In Nonlinear Statistical Inference: Essays in Honor of Takeshi Amemiya, eds. Hsiao, C., Morimune, K., and Powell, J. New York: Cambridge University Press (in press).Google Scholar

Manski, Charles F., and Lerman, Steven R. 1977. “The Estimation of Choice Probabilities from Choice Based Samples.” Econometrica 45(8): 1977–1988.CrossRef Google Scholar

Manski, Charles F., and McFadden, Daniel. 1981. “Alternative Estimators and Sample Designs for Discrete Choice Analysis.” In Structural Analysis of Discrete Data with Econometric Applications, eds. Manski, Charles F. and McFadden, Daniel. Cambridge: MA: MIT Press.Google Scholar

Maoz, Zeev, and Russett, Bruce. 1993. “Normative and Structural Causes of Democratic Peace, 1946–86.” American Political Science Review 87(3): 624–638.CrossRef Google Scholar

McCullagh, Peter. 1987. Tensor Methods in Statistics. New York: Chapman and Hall.Google Scholar

McCullagh, P., and Nelder, J. A., 1989. Generalized Linear Models, 2nd ed. New York: Chapman and Hall.CrossRef Google Scholar

Mehta, Cyrus R., and Patel, Nitin R. 1997. “Exact Inference for Categorical Data,” unpublished paper. Cambridge, MA: Harvard University and Cytel Software Corporation.Google Scholar

Nagelkerke, Nico J.D., Moses, Stephen, Plummer, Francis A., Brunham, Robert C., and Fish, David. 1995. “Logistic Regression in Case-Control Studies: The Effect of Using Independent as Dependent Variables.” Statistics in Medicine 14: 769–775.CrossRef Google Scholar PubMed

Prentice, R. L., and Pyke, R. 1979. “Logistic Disease Incidence Models and Case-Control Studies.” Biometrika 66: 403–411.CrossRef Google Scholar

Ripley, Brian D. 1996. Pattern Recognition and Neural Networks. New York: Cambridge University Press.CrossRef Google Scholar

Rosenau, James N., ed. 1976. In Search of Global Patterns. New York: Free Press.Google Scholar

Rothman, Kenneth J., and Greenland, Sander. 1998. Modern Epidemiology, 2nd ed. Philadelphia: Lippincott-Raven.Google Scholar

Schaefer, Robert L. 1983. “Bias Correction in Maximum Likelihood Logistic Regression.” Statistics in Medicine 2: 71–78.CrossRef Google Scholar PubMed

Scott, A. J., and Wild, C. J. 1986. “Fitting Logistic Models Under Case-Control or Choice Based Sampling.” Journal of the Royal Statistical Society, B 48(2): 170–182.Google Scholar

Signorino, Curtis S. 1999. “Strategic Interaction and the Statistical Analysis of International Conflict.” American Political Science Review 93(2): 279–298.CrossRef Google Scholar

Signorino, Curtis S., and Ritter, Jeffrey M. 1999. “Tau-b or Not Tau-b: Measuring the Similarity of Foreign Policy Positions.” International Studies Quarterly 40(1): 115–144.CrossRef Google Scholar

Smith, Richard L. 1998. “Bayesian and Frequentist Approaches to Parametric Predictive Inference.” In Bayesian Statistics, eds. Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M. New York: Oxford University Press.Google Scholar

Tanner, M. A. 1996. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd ed. New York: Springer-Verlag.CrossRef Google Scholar

Tucker, Richard. 1998. “The Interstate Dyad-Year Dataset, 1816–1997,” Version 3.0. http://www.fas.harvard.edu/∼rtucker/data/dyadyear/.Google Scholar

Tucker, Richard. 1999. “BTSCS: A Binary Time-Series-Cross-Section Data Analysis Utility,” Version 3.0.4. http://www.fas.harvard.edu/∼rtucker/programs/btscs/btscs.html.Google Scholar

Vasquez, John A. 1993. The War Puzzle. Cambridge, New York: Cambridge University Press.CrossRef Google Scholar

Verba, Sidney, Schlozman, Kay Lehman, and Brady, Henry E. 1995. Voice and Equality: Civic Voluntarism in American Politics. Cambridge, MA: Harvard University Press.Google Scholar

Wang, C. Y., and Caroll, R. J. 1995. “On Robust Logistic Case-Control Studies with Response-Dependent Weights.” Journal of Statistical Planning and Inference 43: 331–340.CrossRef Google Scholar

Xie, Yu, and Manski, Charles F. 1989. “The Logit Model and Response-Based Samples.” Sociological Methods and Research 17(3): 283–302.CrossRef Google Scholar

Article contents

Logistic Regression in Rare Events Data

Abstract

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests