Article contents
Ecological Regression Versus Homogeneous Units: A Specification Analysis
Published online by Cambridge University Press: 04 January 2016
Extract
Writing more than twenty five years ago, W. S. Robinson assailed the assumption implicit in much empirical work that statistical measures computed for aggregate units—states, provinces, counties, cities, wards, school districts—could be used in place of corresponding measures for the individuals comprising these units. Robinson was not the first scholar to expose the pitfalls of naive inferences from group level data to individual level behavior. But he first brought this problem forcefully to the attention of practicing social scientists. Focusing on the Pearson product moment correlation coefficient (r), Robinson proved that an aggregate level coefficient need not be equal in value to the corresponding individual level coefficient. Blending empirical examples with mathematical demonstration, Robinson showed that when the individual level correlation between being a native American and being able to read was computed for state level percentages of natives and literates, its value changed from .118 to -.526. Thus an investigator relying on correlations computed for the American states would not even correctly assess the direction of the relationship between nativity and literacy. Coining new jargon, Robinson used the term “ecological fallacy” to describe a naive inference from the group to the individual level of analysis.
The reverberations of Robinson’s work are still being felt by those interested in the past behavior of individuals. Historians without access to survey research and experimental techniques must rely on data that have already been collected. Because such data so often pertain to aggregate units, historians frequently must use cross-level inference to estimate the behavior of individuals. Having paid virtually no attention to the methodology of cross-level inference for almost twenty years after the publication of Robinson’s work, historians have suddenly discovered the “ecological fallacy.” Authors dread to see the term “ecological fallacy” scribbled in the margins of their work; to fall victim to this fallacy is to forfeit one’s claim to methodological legitimacy.
- Type
- Research Article
- Information
- Copyright
- Copyright © Social Science History Association 1978
References
Notes
1 Robinson, William S., “Ecological Correlations and the Behavior of Individuals,” American Sociological Review, 15 (June 1950), 351-57.CrossRefGoogle Scholar
2 For some methodological treatments of the “ecological fallacy” that have appeared in the historical literature, see Terrence Jones, E., “Ecological Inference and Electoral Analysis,” The Journal of Interdisciplinary History, 2 (Winter 1972), 249-69CrossRefGoogle Scholar; Morgan Kousser, J., “Ecological Regression and the Analysis of Past Politics,” The Journal of Interdisciplinary History, 4 (Autumn 1973), 237-62CrossRefGoogle Scholar; and Lichtman, Allan J., “Correlation, Regression, and the Ecological Fallacy: A Critique,” The Journal of Interdisciplinary History, 4 (Winter 1974), 417-33.CrossRefGoogle Scholar
3 These methods fall into two traditions: the use of grouping procedures and the use of computational procedures. Economists were the first to concentrate on the properties of “optimal” grouping by examining which of several grouping strategies could minimize both the bias and variance of aggregate estimates of individual behavior. See Prais, S.J. and Aitchison, J., “The Grouping of Observations in Regression Analysis,” Review of the International Statistical Institute, 22 (1954), 1-22CrossRefGoogle Scholar; Cramer, J.S., “Efficient Grouping, Regression and Correlation in Engel Curve Analysis,” Journal of the American Statistical Association, 59 (March 1964), 233-50CrossRefGoogle Scholar; and Feige, Edgar L. and Watts, Harold W., “An Investigation of the Consequences of Partial Aggregation of Micro-economic Data,” Econometrica, 40 (March 1972), 343-60CrossRefGoogle Scholar. Sociologists have been similarly concerned. See Blalock, Hubert M. Jr., Causal Inferences in Nonexperimental Research (Chapel Hill, 1964), 95-114Google Scholar; and Hannan, Michael T. and Burstein, Leigh, “Estimation From Grouped Observations,” American Sociological Review, 39 (June 1974), 374-92CrossRefGoogle Scholar. Typical of theorists focusing on computational procedures is work done by Phillips Shively, W., “‘Ecological Inference’: The Use of Aggregate Data to Study Individuals,” American Political Science Review (December 1969), 1183-96CrossRefGoogle Scholar; as well as the classical work of Duncan, Otis Dudley and Davis, Beverly, “An Alternative to Ecological Correlation,” American Sociological Review, 18 (December 1953), 665-66CrossRefGoogle Scholar; and of Goodman, Leo A., “Some Alternatives to Ecological Correlation,” American Journal of Sociology, 64 (May 1959), 610-25.CrossRefGoogle Scholar
4 For more detailed reinterpretations of aggregate problems as errors of specification, see Hanushek, Eric A., Jackson, John E. and Kain, John F., “Model Specification, Use of Aggregate Data, and the Ecological Correlation Fallacy,” Political Methodology 1, (Winter 1974), 89-107Google Scholar; Irwin, Laura and Lichtman, Allan J., “Across the Great Divide: Inferring Individual Level Behavior From Aggregate Data,” Political Methodology, 3 (Fall 1976), 411-39Google Scholar; and Langbein, Laura Irwin and Lichtman, Allan J., Ecological Inference (Sage University Paper series on Quantitative Applications in the Social Sciences, 1978)CrossRefGoogle Scholar.
5 A bar over a variable indicates that it is the within group average for each unit.
6 Random grouping also has this property; we do not consider it explicitly here since it is a trivial case unlikely to be encountered in practical situations.
7 A formal proof is straightforward. A statistic is said to be unbiased if its expected value equals the mean of its sampling distribution, or its parametric value. For any regression coefficient, bxy, its expected value is B + E [s(x,u)]/s(x2), where s(x,u) is the covariance between the independent variable and the error term and s(x2) is the variance of the independent variable. By definition, perfect specification of an individual level model means that E[s(x,u)]=O; therefore, E(byx) = B. Since grouping by x does not confound x and u at the aggregate level, As a result, .
8 More formally, aggregation bias occurs because . At the individual level, E (byx) = B, since the individual relation is perfectly specified. Grouping by Y, however, confounds x and u. Therefore, .
9 The amount of aggregation bias is expressed by the difference between the specification bias of aggregate and individual equations. At the aggregate level, specification bias results from the omission of a third variable, z, that is related to both x and y. Thus, the properly specified aggregate model is:
where The misspecified model omits z:
where If there is no specification bias at the individual level, aggregation bias equals the specification bias in the aggregate equation, which is If the individual level model is also misspecified by the omission of z, aggregation bias is which is the difference between specification errors at the aggregate and individual levels. In general, the effects of any grouping procedure can be determined by examining how it affects the specification of an aggregate level model; aggregation bias can then be evaluated by comparing the specification of aggregate and individual models. See Irwin and Lichtman, “Across the Great Divide,” and Langbein and Lichtman, Ecological Inference, for fuller discussions.
10 Proper specification of the aggregate level model may also require the inclusion of additional variables, but this would excessively complicate the demonstration.
11 Thus, if A = bias in the aggregate model due to grouping, M = bias in the aggregate model due to misspecification of the micromodel, and T = total bias in the aggregate model, and A and M have opposite signs, then T = A + M. It follows that |T I < |M|if |A| < |2M|. (Note: these are all absolute values hence the notation||.
12 That is, if T = O and then T = A + M implies that A = - M.
13 We continue to assume that specification error at the aggregate level is less than twice the magnitude of the direct relationship between X and Y.
14 For example, see Allswang, John M., A House for All Peoples: Ethnic Politics in Chicago, 1890-1936 (Lexington, Ky., 1971)Google Scholar; Burner, David, The Politics of Provincialism; The Democratic Party in Transition, 1918-1932 (New York, 1968)Google Scholar; Formisano, Ronald P., The Birth of Mass Political Parties: Michigan, 1827-1861 (Princeton, New Jersey, 1971)Google Scholar; and Kleppner, Paul, The Cross of Culture; A Social Analysis of Midwestern Politics, 1850-1900 (New York, 1970)Google Scholar. For a critique of the “ethnocultural” literature that also treats its analysis of homogeneous groups see Kousser, J. Morgan, “The ‘New Political History’: A Methodological Critique,” Reviews in American History, 4 (March 1976), 1-14.CrossRefGoogle Scholar
15 Jensen, Richard, “Aggregate versus Survey Data: The Psephologist’s Puzzle,” paper presented at the Social Science History Association, Philadelphia, Pennsylvania, 30 October 1976.Google Scholar
16 Burner, The Policies of Provincialism, and Allswang, A House for All Peoples.
17 Goodman, “Some Alternatives.”
18 Hanushek, Jackson, and Kain, “Model Specification,” offer a dramatic illustration of this point by reanalyzing Robinson’s data. By introducing proper controls in the state level regression equation, their aggregate regression estimate of white- nonwhite differences in literacy very nearly approximates the estimate obtained from individual level data. For similar demonstrations with actual data see Langbein and Lichtman, Ecological Inference.
19 Lichtman, Allan J., “Critical Election Theory and the Reality of American Presidential Politics, 1916-1940,” The American Historical Review, 81 (April 1976), 317-48.CrossRefGoogle Scholar
20 See Irwin and Lichtman, “Across the Great Divide,” for the full explication.
21 For the dichotomous case, see Goodman, “Some Alternatives,” For the polytomous case, see Stokes, Donald E., “Cross-Level Inference as a Game Against Nature,” in Bernd, J., ed., Mathematical Applications in Political Science, 4 (Charlottsville, 1969), 62-83.Google Scholar
22 Logit and probit analysis produce probability estimates, but are inappropriate for interval level dependent variables.
- 1
- Cited by