The use of digitized newspaper data by economic historians has become more prominent in recent years. We propose a novel use of such data to overcome measurement error, a problem that is pervasive in the statistical analysis of historical data. Given that regression coefficients of mismeasured variables are attenuated (Aigner Reference Aigner1973), measurement error can lead promising research to be abandoned. A solution to such attenuation bias for continuous variables with classical measurement error is to use an instrumental variables approach leveraging a second mismeasured data source as the instrument. In the absence of other endogeneity concerns, Footnote 1 as long as the measurement error in the two variables is uncorrelated, instrumenting for one mismeasured variable, X 1, with data from a second mismeasured source, X 2, recovers the true parameter (see Chalfin and McCrary Reference Chalfin and McCrary2018). Footnote 2 The main limitation of this approach is that it is difficult to find a second variable that is (i) measured with error, which is arguably uncorrelated with the error in X 1, and (ii) reasonably inexpensive to collect. Since economic historians often spend a significant amount of time and effort on original data collection, it is usually costly enough to just have X 1.
In this paper, we show how the second measure, X 2, can often be generated at a low cost from textual data available via digitized newspapers and how it can be used to resolve measurement error in the case where X 1 is continuous or binary. Footnote 3 The distinction between continuous and binary variables is important because using X 2 as an instrument for X 1 to recover the true parameter only applies to cases of classical measurement error, which requires X 1 to be continuous (Bingley and Martinello Reference Bingley and Martinello2017). Footnote 4 If X 1 is binary and mismeasured, then IV estimates will be inflated by the inverse of the misclassification rate in X 1. This is true even when the instrument is generated by an otherwise perfectly valid natural experiment.
We provide three potential solutions when X 1 is binary. First, the treatment effect can be set identified. The OLS estimate using X 1 as treatment provides a lower bound, while the IV estimate using X 2 as an instrument for X 1 provides the upper bound such that. Second, we show that restricting the analysis to an agreement sample where X 1 = X 2 can substantially reduce the OLS bias. The probability that both variables are jointly misclassified is the product of the two variables’ misclassification rates, and therefore the measurement error in the agreement sample tends to be much lower. Footnote 5 Third, we provide a parametric bias correction procedure that can recover the true parameter of interest as a nonlinear combination of the OLS and IV coefficients. All three procedures are fast and efficient, and given that newspaper data can be scraped in a reasonable amount of time, we hope to provide researchers who work with historical data with low-cost tools for dealing with measurement error. We begin the demonstration of our three procedures by replicating two recent papers that study the economic impact of the spread of the boll weevil across the U.S. South in the late nineteenth and early twentieth centuries, one by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) and one by Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017). Footnote 6
To date, the sole source of data used by analysts to measure the timing of the boll weevil’s arrival at the county-level comes from a U.S. Department of Agriculture (USDA) map by Hunter and Coad (Reference Hunter and Coad1923), which documents the arrival date of the pest across Southern counties between 1892 and 1922. While the map itself is mostly accurate, it does contain errors. Footnote 7 Further, it does not necessarily measure what economists are typically interested in, namely the timing of the economic damage caused by the arrival of the boll weevil. As an example, if the weevil arrived late in the summer, it would typically hibernate soon after arrival, and thus the actual economic damage would not occur until the following year. The arrival date from the USDA map is therefore a mismeasured proxy for the date of the actual economic impact. And, as we document, this mismeasurement can markedly attenuate estimated effect sizes.
To produce a second measure for the arrival of the boll weevil, we collect data from Newspapers.com by jointly searching the database for pages containing “boll weevil” and each county’s name in all newspapers in the county’s state for each year between 1882 and 1932. Our arrival measure is then the peak salience of the weevil in the news as measured by the maximum five-year moving average of boll weevil-related pages. Footnote 8 We argue that errors in this newspaper-based measure are likely to be uncorrelated with errors on the USDA map, which was generated by trained USDA entomologists who reported back to the federal agency, whereas local newspaper reporters mainly wrote about salient issues in their home counties. Using an event study design, we also show that the newspaper-based salience peaks a year after the official USDA arrival date on average.
Our replications of Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) and Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) show that using our newspaper-based arrival measure can reduce measurement error and strengthen the results in both papers. In particular, our theory suggests a ranked pattern between the three proposed solutions, where . While we do not observe the true coefficient, the estimated coefficients largely follow the prescribed pattern in both replication exercises. We find evidence that measurement error led to lower coefficient estimates in both studies, a finding that is robust across alternative specifications of our newspaper-based arrival date. However, the difference in the coefficients produced by our procedures was only statistically significant for Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017). We discuss the frequency of the time dimension as a potential reason for this finding, as Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) use annual data while Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) use data over five-year intervals.
We provide a broader discussion of when data generation from newspaper articles is a promising avenue, what settings are suitable for our approach, and the value of historical newspapers to generate novel data for research in economic history. Even though our newspaper-based measures were generated in a fast and arguably unrefined way, using this noisier measure still produces smaller but significant effects that are comparable to those in Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) and Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017). Lastly, to show that our approach extends to other settings, we further replicate a study by Hilt and Rahn (Reference Hilt and Rahn2020) of the liberty loan program’s effect on political outcomes, as well as a paper by Howard and Ornaghi (Reference Howard and Ornaghi2021), which studies the impact of the adoption of local prohibition policies on population, agricultural outcomes, and investment. Their treatment measures are different in nature from the boll weevil, an arrival time measure, to provide additional examples for when our strategies can be gainfully applied to deal with measurement error in historical data.
Our paper highlights the usefulness of digitized newspapers to generate additional data to address measurement error. We extend the secondary measure IV framework in Chalfin and McCrary (Reference Chalfin and McCrary2018) to the case where treatment is binary and when instrumenting ordinarily does not resolve measurement error (Bingley and Martinello Reference Bingley and Martinello2017). We also contribute to a recent literature that uses digitized newspapers to generate novel data for research in economic history. This includes measures of media competition and partisan influence (Gentzkow, Shapiro, and Sinkinson Reference Gentzkow, Shapiro and Sinkinson2014; Gentzkow et al. 2015), racial and anti-group sentiment (Ferrara and Fishback Reference Ferrara and Fishback2023; Ottinger and Winkler 2022; Bazzi et al. Reference Bazzi, Andreas Ferrara, Pearson and Testa2023), the spread of news relating to racial violence (Albright et al. 2021; Calderon, Fouka, and Tabellini Reference Calderon, Fouka and Tabellini2023), technology diffusion (Feigenbaum and Gross 2022), the 1918 influenza (Beach, Clay, and Saavedra Reference Beach, Clay and Saavedra2022), fertility restrictions (Beach and Hanlon Reference Beach and Hanlon2023), advertisements for the movie “Birth of a Nation” (Esposito et al. Reference Esposito, Rotesi, Saia and Thoenig2023; Ang Reference Ang2023), the price and types of available cotton seeds (Rhode Reference Rhode2021), among others.
BACKGROUND AND MEASUREMENT OF THE BOLL WEEVIL INFESTATION
We motivate the econometric theory by replicating two recent studies on the boll weevil infestation in the U.S. South to provide an example of how measurement error can be addressed with historical newspaper data. We first give a brief background on the boll weevil and measurement issues in the USDA data, which tracked the spread of the pest, followed by a discussion of how we use digitized newspaper data to generate a second boll weevil arrival measure before turning to the econometric theory.
The Spread of the Boll Weevil and Uses of the USDA Map
The boll weevil spread across the U.S. South starting in 1892 near Brownsville, Texas. The beetle, which gained its name because of its diet consisting mainly of cotton bolls and flowers, had infested all Southern cotton-growing regions by 1922. Given that cotton at the time was still the main cash crop in Southern agriculture (Wright Reference Wright2013), the arrival of the pest had a substantial impact on the areas it infested. Consequently, the USDA traced the arrival of the weevil on a map in an annual report by Hunter and Coad (Reference Hunter and Coad1923). A portion of this map is shown in Figure 1. During peak infestation in 1921, cotton acreage had declined by 31 percent (Ager, Brueckner, and Herz Reference Ager, Brueckner and Herz2017), and the USDA estimated the average economic loss per year to be 200 to 300 million USD between 1916 and 1920 (Hunter and Coad Reference Hunter and Coad1923). Footnote 9 Given this substantial economic shock, a well-developed literature has studied the various impacts of the boll weevil on different aspects of the Southern economy.
Lange, Olmstead, and Rhode (Reference Lange, Olmstead and Rhode2009) show the large negative impact of the pest on cotton production, yields, and land value. The drop in productivity also altered the structure of Southern agriculture with a reduced number of tenant farmers, farm wages, and female labor force participation (Ager, Brueckner, and Herz Reference Ager, Brueckner and Herz2017). Ager, Herz, and Brueckner (Reference Ager, Herz and Brueckner2020) provide evidence that the lower returns to agriculture reduced fertility due to the opportunity cost of children and the decreased value of child labor. Also, Black Southerners tended to marry later after the pest arrived for the same reasons (Bloome, Feigenbaum, and Muller Reference Bloome, Feigenbaum and Muller2017). This fertility transition and the decline in the value of child labor in agriculture have also been linked to increased educational attainment (Baker Reference Baker2015; Baker, Blanchette, and Eriksson Reference Baker, Blanchette and Eriksson2020). Another unintended consequence of the reduction in cotton production was increased food production. Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) show that this significantly contributed to the reduction in pellagra deaths. In a later paper, the authors also found that the boll weevil spread reduced the racial income gap in the South (Clay, Schmick, and Troesken 2020). Similar to the population movements discussed in Lange, Olmstead, and Rhode (Reference Lange, Olmstead and Rhode2009), Feigenbaum, Mazumder, and Smith (2020) show that the decline in cotton reliance also resulted in less violence against Black Southerners who saw an increased ability to move away from overtly discriminatory behavior.
Most of the papers noted previously either assign the arrival date for a county whenever the USDA map’s first arrival year line crosses that county’s area, or the arrival date is selected for the year line that contains most of the county’s area (see Figure 1). What should be noted is that the solid lines in the map technically show the farthest extent of the boll weevil in any territory. This measure does not necessarily correlate with the exact timing of damage caused by the insect. Mature boll weevils hibernate during the winter and infest the cotton fields after the crop season in the subsequent year. Lange, Olmstead, and Rhode (Reference Lange, Olmstead and Rhode2009) explicitly mention this caveat in their paper: “First contact usually occurred during the August seasonal migration, too late to build up significant populations or do much damage in that year. Maximum damage occurred after the local weevil population became established and multiplied. Thus, the classic USDA maps detailing the spread of the weevil present a somewhat misleading picture of the area ravaged by the insect” (p. 689).
Measuring the Boll Weevil’s Arrival from Newspaper Data
Newspapers were the primary source of information in the late nineteenth and early twentieth centuries and mainly operated locally in the county where the paper was based (Gentzkow, Shapiro, and Sinkinson Reference Gentzkow, Shapiro and Sinkinson2014). Newspapers published articles about the boll weevil’s arrival as well as damages caused by the insects. An example of such reporting is shown in Online Appendix Figure A.1. Digitized newspaper data are a potential source to generate information on the arrival and damage extent caused by the pest, independent of the USDA map. We use Newspapers.com as our primary data source for digitized historical newspapers. To the best of our knowledge, this is the largest newspaper archive available online. Footnote 10
For each county, in order to construct our newspaper-based boll weevil arrival and salience measure, we take all of the available newspapers from said county’s state and identify the number of newspaper pages that include both the words “boll weevil” and the county’s name for each year. Footnote 11 We use all newspapers from an individual county’s state because no newspaper archive has information on the universe of newspaper pages. Thus, as described, our search not only considers pages in the county of interest but in all counties that are in the same state (e.g., Online Appendix Figure A.2). So, even if Autauga County in Alabama has no available newspaper pages for the search period but “Autauga County” and “boll weevil” are mentioned in a newspaper based in Barbour County, Alabama, we are able to obtain data for Autauga County. Some counties may feature more prominently in the news than others, which is why we need to adjust these counts for the overall number of pages that mention the county. Thus, we apply the same search logic to generate the numerator in our boll weevil measure, which we compute as
where %BW ct captures the salience of the boll weevil for county c in year t in the news. Our sample includes 911 infested counties from 13 Southern states between 1882 and 1932, Footnote 12 which is ten years before and after the time periods covered by the USDA map.
How does our salience measure relate to the official arrival date on the USDA map? To answer this question formally, we use an event study design and estimate the following equation:
where %BW ct is our newspaper-based salience measure for county c in year t, and $$D\left( {t - BW_c^{USDA} = \ell } \right)$$ is an event indicator relative to the arrival of the boll weevil from the USDA map for the ten years before and after the official arrival date. The year before the arrival on the USDA map, ℓ = –1, is omitted and serves as the baseline period. The county fixed effects π c capture time-invariant unobservable county characteristics and aggregate time trends that affect counties jointly in each state are captured by state-by-year fixed effects γ st. Standard errors are clustered at the county-level. Given the recent literature on issues related to event study designs, we use the estimator developed by Sun and Abraham (Reference Sun and Abraham2021).
Our main interest is in the lag coefficients β ℓ for ℓ ≥ 0. If salience in the news correlates highly with the USDA arrival date, then we should observe an immediate jump at the treatment date ℓ = 0, followed by an either constant or slowly decaying coefficient pattern. Conversely, if the weevil tends to arrive later in the summer and hibernates, the more salient economic damage would occur in the following year, which implies that the main effect on salience in the news should occur after ℓ = 0. The pattern of the coefficients should not only be informative about the decay in salience after arrival but also reveal potential anticipatory behavior if the lead coefficients are significant for ℓ < –1.
Figure 2 plots the dynamic treatment effects for the 20-year event window around each county’s boll weevil arrival date on the USDA map on our newspaper-based salience measure. The figure shows the coefficients from estimating Equation (2) via two-way fixed effects (TWFE) and with the estimator developed by Sun and Abraham (Reference Sun and Abraham2021). We find that the salience measure significantly increases in counties after the boll weevil’s arrival, based on the USDA map. More importantly, the effect is largest one year after the arrival date on the USDA map. This confirms the narrative that salience in the news and arrival are somewhat but not perfectly correlated due to the pests’ hibernation if they arrive later in the summer (see Harned Reference Harned1910). While the post-arrival coefficients slowly decay, they are still statistically significant even ten years after the arrival of the weevil. We find no evidence for anticipatory reporting in the four years prior to the USDA map’s arrival date. For earlier periods, there are significant coefficients in the TWFE results. We find no pre-trends using the estimator by Sun and Abraham (Reference Sun and Abraham2021).
Prediction of the Boll Weevil Infestation Using Historical Newspapers
To generate a stable prediction of the boll weevil’s arrival based on newspaper data that is less prone to outliers or noise, we first apply a five-year moving average
and then assign the maximum of this smoothed variable as predicted year of infestation
For robustness, we later test alternative specifications such as the three- and seven-year moving averages, as well as the maximum salience measure %BW c,t within a ten-year window around the USDA map arrival date. While our preferred specification is MA(5), the results in the replication exercises are robust across alternative specifications. More details are discussed next.
To illustrate how our approach based on newspapers can predict a county’s effective infestation, consider the following example for Marion County in Mississippi. The USDA map recorded that the boll weevil arrived in Marion in 1909. However, the damage caused by the insect was not severe. Harned (Reference Harned1910), the head of the department and entomologist for the Mississippi Agricultural Experiment Station, investigated the infestation in Mississippi during 1907 and 1909. For Marion County, he found that “boll weevils probably spread entirely over this county during September, 1909, although not in large enough number to do serious damage” (Harned Reference Harned1910, p. 22). For each year between 1882 and 1932, we first calculate the salience of the boll weevil of Marion County using pages mentioning “boll weevil” and “Marion County.” We calculate MA(5)Marion,t for each year and define the effective infestation of Marion County by choosing the year with the maximum MA(5)Marion,t. Our newspaper-based approach predicts that the effective infestation was in 1910 in Marion County, which is one year after the boll weevil’s arrival in 1909, according to the USDA map. In panel (a) of Figure 3, we plot our newspaper-based boll weevil salience measure (dashed line) and the smoothed version using its five-year moving average (solid line) over time for an example county. While our salience measure based on newspapers is noisy, the five-year moving average smooths out this noise. Peak salience in the news appears to be a reasonable approximation for the arrival of the pest. The raw correlation of the two measures is 0.7, and Online Appendix Figure A.3 provides a visualization of this correlation with a binned scatter plot.
Lastly, we provide a comparison between our predicted arrival date from Equation (3) and that provided by the USDA map. Panel (b) of 3 plots the difference in the two arrival dates for the 911 counties in our sample. A positive difference means that the predicted year based on newspapers is later than the arrival of the boll weevil as presented on the USDA map. While the difference is typically small, less than four years for more than half of the sample counties (54.88 percent), we find that the difference is extreme for a small number of counties. This result is likely due to the noise in the newspaper data, such as cases where the search words appear in separate articles even though they appear on the same newspaper page.Footnote 13 It should be kept in mind that our measure is, in some ways, purposefully noisy simply to reduce the cost of collecting the data. More refined versions are possible by applying a visual inspection of the newspaper data, which would increase the cost and time of the data collection process.
Another reason for some of the extreme values in the difference is due to some newly constructed counties. An example is shown in Online Appendix Figure A.4. Dixie County, in Florida, was created in 1921 from the southern portion of Lafayette County. While the boll weevil arrived in Dixie County in 1916, according to the USDA map, our newspaper-based measure predicts its effective infestation in 1932. This is because our prediction is based on newspapers mentioning “Dixie County.” Since Dixie County did not exist before 1921, the prediction is mostly based on newspapers after 1921, which is shown in panel (a) of Online Appendix Figure A.4. One possible solution is to aggregate those counties as “multi-counties,” as in Lange, Olmstead, and Rhode (Reference Lange, Olmstead and Rhode2009) and Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017), or assign the predicted year from its original county.Footnote 14
RESOLVING BIAS FROM MEASUREMENT ERROR USING SECONDARY MEASURES
Classical Measurement Error
How can the second measure for the boll weevil arrival from newspaper data be used to correct for measurement error on the USDA map arrival date? First, consider the case where the data is used as a continuous exposure measure, such as years since the arrival of the pest, for instance. Suppose a researcher wants to estimate the following linear equation by OLS, which is assumed to be unconfounded with a clear direction of causality but where the years since the arrival of the boll weevil, X 1, are continuous and measured with error,
where Cov(X*,u) = 0, β is the true parameter, and X* is the true measure (i.e., measured without error). The estimated coefficient will then suffer from the typical attenuation bias. Now suppose there is a second variable that seeks to capture X* as well but that is also mismeasured, X 2 = X* + e, and for which the same conditions apply as for X 1. We can then use X 2 as an instrument for X 1 to solve the measurement error problem (see Chalfin and McCrary Reference Chalfin and McCrary2018). The IV estimate will be
where we denote the estimator and treatment variable of interest in the subscripts of . In the absence of any other endogeneity problems and if the two measurement errors are uncorrelated such that Cov(u,e) = 0, the IV estimate will recover the true parameter. As with the exclusion restriction, one would then have to make an argument as to why the two errors should be uncorrelated or that this correlation is close to zero. In the case of the boll weevil, a possible argument would be that the USDA map was compiled by trained entomologists who primarily reported back to the agency, whereas the newspapers were written by journalists who reacted to local developments in their county. If journalists were basing their stories, and in particular the timing of their articles, on the USDA map, then this assumption fails, in which case Cov(u,e) > 0 and the estimated IV coefficient in Equation (4) would be biased downward.Footnote 15
Since applied economists tend to think hard about the exclusion restriction, we would like to highlight that this condition is satisfied in our case by assuming no endogeneity concerns other than measurement error. If X 2 affects y through channels other than X 1, such other channels must necessarily be in . Since X 2 and X 1 seek to measure the same quantity, this essentially also implies a correlation between X 1 and the error term as well. This is something that our approach cannot solve. At best, X 2 can remove biases relating to measurement error but not those stemming from omitted variables or reverse causality, for instance.
Non-Classical Measurement Error
Oftentimes, the arrival or presence of the boll weevil, however, is coded as a binary variable (e.g., Clay, Schmick, and Troesken Reference Clay, Schmick and Troesken2019, 2020; Ager, Brueckner, and Herz Reference Ager, Brueckner and Herz2017). In this case, the IV coefficient will no longer be unbiased because when the treatment variable is discrete or binary, measurement error is no longer classical by construction (Bingley and Martinello Reference Bingley and Martinello2017).Footnote 16 Suppose that X 1 is now binary. When regressing y on X 1, the estimated OLS coefficient is still attenuated with , where θ is the misclassification rate in X 1 (Aigner Reference Aigner1973). If θ = 0, then there is no measurement error, whereas θ = 1 means that X 1 is entirely randomly misclassified, such that it is uncorrelated with X* and therefore contains no usable information. Now suppose that X 2 is also binary and misclassified, but with an error γ that is uncorrelated with θ, and γ < θ. If we then regress y on X 2, the estimated coefficient will also be biased, , however, this attenuation bias will be smaller than for X 1 since β(1 – γ) > β(1 – θ) in absolute terms.
If we instrument X 1 with X 2, or vice versa X 2 with X 1, the estimated coefficient for those two cases will be
depending on which variable was used as the treatment and the instrument. The IV bias is the inverse of the respective OLS bias.Footnote 17 Unlike OLS, which suffers from attenuation bias, the IV estimate will be inflated instead with .Footnote 18 Neither OLS nor IV yield an unbiased estimate; however, we now offer three potential approaches for identifying the treatment effect or for at least minimizing the attenuation coming from the misclassification.
Solution 1 - set identification: Even though the true parameter of interest cannot be directly point identified, the OLS and IV coefficients can be used as lower and upper bounds, respectively, to set identify β given that . While it is not known a priori whether X 1 or X 2 has the higher measurement error, the inequality previously noted suggests that the set order can be inferred from the relative magnitudes of the OLS and IV coefficients. In the previous example, set identification implies that . Without additional assumptions, these bounds are tight and are informative as long as zero is not included in the set. To assess the latter condition, the OLS estimate provides the corresponding test that rejects non-informativeness when is significantly different from zero.
Solution 2 - agreement sample: If instrumenting as described earlier is too complicated, for example, if researchers wish to estimate nonlinear treatment effects or their specification includes interactions of the treatment with other variables, the OLS bias can be reduced by considering only the part of the sample for which X 1 and X 2 both provide the same value. We call this an agreement sample.Footnote 19 The probability that both measures are jointly incorrect is θ × γ = δ. For example, suppose the error rates are θ = 0.3 and γ = 0.2, then δ = 0.06, which substantially reduces the OLS bias for , which will be closer to the true parameter.
Solution 3 - parametric bias correction: While neither OLS nor IV on their own identify the true parameter, their estimates can be used jointly to recover β. The bias-corrected (BC) estimate is
Estimation of Equation (5) is straightforward, as the product of two coefficients from different equations can be readily estimated in standard statistical software, with standard errors being estimated via the delta method or bootstrapping. One drawback of this bias correction is that it only works if both and are of the same sign, which should be true in theory but may be violated in practice. This is another reason why we prefer the agreement sample as our main method of bias reduction. Taken together, our three possible solutions yield the following relationship,
which is the pattern that we look for in the subsequent replication exercises.
Testing the Required Assumptions in Practice
A key assumption in our framework is that no other endogeneity issues aside from measurement error are present. A possible concern in this regard is that differential newspaper coverage could generate selectivity issues if such coverage correlates with problematic unobservables that also correlate with the outcome of interest. If this only affects the variable generated from the newspaper data, this has implications for the measurement error correction methods introduced in the previous section. Set identification remains true as long as the bias in the IV is such that. Bias in the opposite direction would imply a widening of the upper bound, making it less informative. Likewise, the parametric bias correction in Equation (5) will not recover the true parameter but an attenuated estimate if , and an inflated estimate if the converse is true.
The method least affected by such biases is the agreement sample, which potentially generates a selected subsample that is not necessarily representative of the underlying population. One available correction is to apply inverse propensity score reweighting.Footnote 20 First, regress the indicator for being included in the agreement sample on a wide set of pre-treatment county characteristics using a Probit regression. Second, obtain the predicted probability from the previous Probit regression. Lastly, run the regression of interest, weighting observations with the inverse of the estimated propensity score. The weights ensure that the estimation sample is more representative of observations in the entire sample.
Whichever method for bias correction is chosen by practitioners, they should always study whether differences between their original and the newspaper-based treatments are systematic by testing if pre-treatment characteristics can predict such differences. Table 1 provides an example of a covariate balancing test where we regress the absolute value of the annual difference on the USDA map versus newspaper-based boll weevil arrival dates on various 1890 county-level characteristics. Variables that consistently generate significant coefficients in this exercise should be controlled for in the main regression of interest.Footnote 21 Depending on the context of their study, other tests and placebos may be applicable, and practitioners should think about possible implementations relevant to their setting.
Notes: Cross-sectional county-level regressions of the absolute difference in predicted boll weevil arrival year from the USDA map and the newspaper-based measure (Column (1)), and indicators for whether this difference is more than a year or two years (Columns (2) and (3)) on standardized county observables in 1890. Observable characteristics include total population (in 1,000), percent Black population, percent urban population, percent farmland in cotton, number of farms per capita, total acres in farms, an indicator for whether the boll weevil arrived late (i.e., if the arrival date is later than the average arrival across all counties), percent manufacturing employment, the log manufacturing wage per capita, and total newspaper pages in the Newspapers.com database by state (in 1,000,000) from 1882 to 1932 times 1890 county population. All observable characteristics are standardized to have mean zero and variance one, except indicator variables, such that coefficients can be interpreted in terms of a one standard deviation increase in the associated variable. Additional geographic controls include latitude, longitude, and state fixed effects. Robust standard errors in parentheses. Significance levels are denoted by * p < 0.10, ** p < 0.05, *** p < 0.01.
Sources: Authors’ calculations from data in Hunter and Coad (Reference Hunter and Coad1923), Haines (2010), and Newspapers.com.
REPLICATION OF CLAY, SCHMICK, AND TROESKEN (2019) AND AGER, BRUECKNER, AND HERZ (2017)
In this section, we replicate two recent papers that study the boll weevil’s impacts on pellagra deaths (Clay, Schmick, and Troesken Reference Clay, Schmick and Troesken2019) and cotton productivity (Ager, Brueckner, and Herz Reference Ager, Brueckner and Herz2017). Implementing our suggested approaches to measurement error based on historical newspaper data, we demonstrate the potential for such data to markedly reduce attenuation bias. Our results suggest that the impact of the boll weevil was larger than previously documented. Further, our analysis largely confirms the ranked pattern for the different measurement error approaches as suggested by Equation (6) in the previous section. Results are robust across the alternative specifications discussed in the previous section.
Replication of Reference Clay, Schmick and Troesken Clay, Schmick, and Troesken (2019)
Using annual data between 1915 and 1925 for counties in North and South Carolina, Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) show that pellagra deaths decreased following the boll weevil infestation. They argue that this outcome can be explained by the resulting diversification in food production. After the boll weevil infestation, the prevailing cotton monoculture was switched to more niacin-rich crops such as corn and sweet potatoes. This led to the fall of pellagra, which is a disease related to insufficient niacin consumption. Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) estimate the following regression equation,
where ln[pellagra]ct is the log number of pellagra deaths, or the log pellagra death rate in other specifications, and boll weevil ct is an indicator for whether or not the boll weevil has arrived in county c as of time t. They provide results with and without the additional interaction of the boll weevil variable and an intensity measure. The latter is an indicator for whether a county was in the top quartile of either (i) the pre-treatment pellagra death rates measured as average for 1915/16 or (ii) cotton acres per capita in 1909. County and year fixed effects are captured by θ c and θ t, and standard errors are clustered at the county-level.
Our Table 2 replicates the corresponding Table 3 in Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) using the arrival date from the USDA map (X 2) and our predicted arrival from the newspaper data (X 1). We label the treatment variable used by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) as X 2, as the results presented in Table 2 suggest that, for their application, the map-based measure contains less measurement error than that based on our newspaper data.Footnote 22 Each column corresponds to different specifications in Table 3 of Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019). Columns (1)–(4) report the impact of the boll weevil on pellagra deaths, and Columns (5)–(8) repeat the same exercise using the log pellagra death rate as outcome. The table reports estimates of θ 1 in Equation (7), and we return to θ 2. The first row reports the OLS results for our newspaper-based arrival date treatment. These coefficient estimates are statistically significant and of the same sign as those provided by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019), except for one statistically insignificant coefficient in Column (4) (same sign, p-value = .11). The second row for is the replication of Table 3 in Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019). The following rows report the coefficient estimates for each specification using the agreement sample, the parametric bias correction, and the IV regressions, respectively. Due to the inclusion of the interaction term in Columns (2) to (4) and Columns (6) to (8), the bias-correction estimate using Equation (5) was only produced for the specifications in Columns (1) and (5). However, the agreement sample approach is still valid under the interaction term models. For the IV models, we follow the standard approach of using the interacted instrument to instrument for the interaction itself. While the IV interaction models do not technically fit the analysis in the theoretical section, the basic intuition still holds, and we believe that a comparison of the IV coefficients remains informative.
Notes: Replication of Equation (1) in Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) using the boll weevil’s arrival from the USDA map (X 2) and the predicted arrival based on newspapers (X 1). Columns (1) and (5) report OLS and IV regressions of deaths by pellagra on an indicator for whether the boll weevil has arrived in county c. The coefficients β BC are estimated using Equation (5) and the delta method. The rest of the columns report OLS and IV regressions of deaths by pellagra on a boll weevil indicator and its interaction term with an indicator for whether county c was in the top 25 percent cotton production in 1909 (Columns (3), (4), (7), and (8)) or a dummy variable equal to one if county c was in the top 25 percent pellagra death rates in 1915/16 (Columns (2) and (6)). The coefficients $${\beta _{OLS,\;{X_1} = {X_2}}}$$ are estimated using a subset of the sample for which X 1 and X 2 both provide the same value (i.e., an agreement sample). In IV regressions, X 1 is instrumented with X 2 and vice versa. The sample is 141 counties in North Carolina and South Carolina between 1915 and 1925. All regressions include county and year fixed effects. Controls include county c’s malaria death rate in 1915 and the share of urban population in 1910, both of which interacted with a full set of year dummies. Standard errors are clustered at the county-level. Significance levels are denoted by * p < 0.10, ** p < 0.05, *** p < 0.01.
Sources: Authors’ calculations from data in Hunter and Coad (Reference Hunter and Coad1923), Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019), and Newspapers.com.
Notes: Replication of Equation (1) in Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) using the boll weevil’s arrival from the USDA map (X 2) and the predicted arrival based on newspapers (X 1). Columns (1) and (5) report OLS and IV regressions of deaths by pellagra on an indicator for whether the boll weevil has arrived in county c. The coefficients β BC are estimated using Equation (5) and the delta method. The rest of the columns report OLS and IV regressions of deaths by pellagra on a boll weevil indicator and its interaction term with an indicator for whether county c was in the top 25 percent cotton production in 1909 (Columns (3), (4), (7), and (8)) or a dummy variable equal to one if county c was in the top 25 percent pellagra death rates in 1915/16 (Columns (2) and (6)). The coefficients $${\beta _{OLS,\;{X_1} = {X_2}}}$$ are estimated using a subset of the sample for which X 1 and X 2 both provide the same value (i.e., an agreement sample). In IV regressions, X 1 is instrumented with X 2 and vice versa. The sample is 141 counties in North Carolina and South Carolina between 1915 and 1925. All regressions include county and year fixed effects. Controls include county c’s malaria death rate in 1915 and the share of urban population in 1910, both of which interacted with a full set of year dummies. Standard errors are clustered at the county-level. Significance levels are denoted by * p < 0.10, ** p < 0.05, *** p < 0.01.
Sources: Authors’ calculations from data in Hunter and Coad (Reference Hunter and Coad1923), Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019), and Newspapers.com.
Focusing on the main effect, θ 1, we draw four main conclusions from our results. First, as might be expected, our newspaper-based arrival measure appears to be more noisy than that provided by the map. Nonetheless, we achieve similar, though smaller, results compared to those of Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019). Thus, in the absence of the USDA map, Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) could have successfully conducted their study using information from newspaper data alone—highlighting the usefulness of digitized historical newspapers as a potential data source for economic historians. Second, the relationship between the various coefficient estimates is consistent with the prediction provided in Equation (6) of our theoretical section. The pattern is more easily seen visually; hence, we provide a version of Column (1) of Table 2 as a bar chart in Online Appendix Figure A.7. Third, for all eight columns, coefficient estimates from the agreement sample and parametric bias correction models are on the order of 40–60 percent larger than the original estimates of Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019), suggesting marked gains from our measurement error corrections. Finally, we note that in the two cases where we can implement our parametric bias correction model, these coefficient estimates are quite similar in magnitude to the agreement sample estimates.
The earlier discussion focused on the estimated main effect, θ 1. To account for the interaction term, θ 2, in Table 3, we report the estimated marginal boll weevil impact for counties in the top 25th percentile of cotton production (Columns (3), (4), (7), and (8)) and pellagra deaths (Columns (2) and (6)).Footnote 23 These results mimic those from Table 2. In all models, we obtain slightly attenuated but significant results based solely on the newspaper data. The agreement sample estimates are highly significant and larger in magnitude than those reported by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019). The pattern of the IV estimates exactly matches the predictions from our theoretical section. Additional results that implement propensity score reweighting for the agreement sample are provided in Online Appendix Table A.1.
Replication of Reference Ager, Brueckner and Herz Ager, Brueckner, and Herz (2017)
To further validate our approach, we replicate a second paper—that of Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017), which refines Lange, Olmstead, and Rhode (Reference Lange, Olmstead and Rhode2009) by considering cotton intensity in each county.Footnote 24 They study the boll weevil’s effect on Southern agriculture in terms of output, labor arrangements, and labor market outcomes using data from 13 Southern states between 1889 and 1929 in five- and ten-year intervals.Footnote 25 The authors show that the boll weevil reduced cotton output and productivity, the number of tenant farms, farm wages, and female labor force participation. They estimate the following linear regression model,
where y ct is a given outcome variable for county c in a given five-year period t. As in the previous study, BollWeevil ct is an indicator of whether a county is infested in the current five-year period. Cotton c,1889 is the demeaned acreage share of cotton planted in 1889 as a measure of cotton intensity. County and time fixed effects are captured by α c and β t, and standard errors are again clustered at the county-level. Because Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) estimate models incorporating interaction terms in all specifications, we are not able to implement the bias correction model, β BC, and we thus focus attention on the agreement sample results as our preferred model.
Table 4 reports the resulting γ coefficients from estimating Equation (8).Footnote 26 Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) find significant main effects in seven of the 12 models that they estimate. Using only our newspaper data, we also find significant results in each of these seven models—with our newspaper-based coefficient estimates being larger in magnitude for all but two of these models. The newspaper data leads to significant estimates of the main effect in three of the five models, where Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) find no effect. For this reason, we keep the same notation in terms of X 1 and X 2 as in Tables 2 and 3 (with X 1 reflecting the newspaper-based data). In six of the seven models where Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) find statistically significant main effects the agreement sample point estimates, $${\beta _{{X_1} = {X_2}}}$$ , are larger in magnitude than those based on either the map data or the newspaper data—the exception being the estimated effect on corn yield in Column (7). Notice that in all seven models, the overall pattern of the OLS and IV estimates matches the predictions of Equation (6). The only exception is Column (7), where the agreement sample estimate is slightly below that of the map-based OLS estimate.
Notes: Replication of Equation (1) in Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) using the boll weevil’s arrival from the USDA map (X 2) and the predicted arrival based on newspapers (X 1). OLS and IV regressions of agricultural and demographic outcome variables on an indicator for whether the boll weevil has arrived in county c and its interaction term with county’s demeaned acreage share of cotton in 1889. The coefficients $${\beta _{OLS,\;{X_1} = {X_2}}}$$ are estimated using a subset of the sample for which X 1 and X 2 both provide the same value (i.e., an agreement sample). In the IV regressions, X 1 is instrumented with X 2 and vice versa. The sample includes counties in the U.S. South between 1889 and 1929. All regressions include county and year fixed effects as well as weather controls. Weather controls are January’s mean temperature and average summer precipitation from May to July. Standard errors are clustered at the county-level. Significance levels are denoted by * p < 0.10, ** p < 0.05, *** p < 0.01.
Sources: Authors’ calculations from data in Hunter and Coad (Reference Hunter and Coad1923), Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017), and Newspapers.com.
To account for the continuous interaction terms in Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017), in Table 5 we present estimated marginal effects at the 75th percentile of cotton production.Footnote 27 The newspaper-based treatment yields significant OLS results in eight of the nine cases where the map-based data gives significant results. In five of these cases, the newspaper-based data leads to larger OLS estimates. The newspaper data also leads to significant OLS results in the three models, which were insignificant when using the map-based data. Estimates using the agreement sample were again larger in magnitude than either newspaper-based or map-based OLS estimates in ten of the 12 models. Within the eight models where both data sets have predictive power, agreement sample estimates are on average 37 percent larger than the original estimates of Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017). Additional results that implement the propensity score reweighting for the agreement sample are provided in Online Appendix Table A.2.
Notes: Replication of Equation (1) in Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) using the boll weevil’s arrival from the USDA map (X 2) and the predicted arrival based on newspapers (X 1). OLS and IV regressions of agricultural and demographic outcome variables on an indicator for whether the boll weevil has arrived in county c and its interaction term with county c’s demeaned acreage share of cotton in 1889. The coefficients $${\beta _{OLS,\;{X_1} = {X_2}}}$$ are estimated using a subset of the sample for which X 1 and X 2 both provide the same value (i.e., an agreement sample). In the IV regressions, X 1 is instrumented with X 2 and vice versa. The sample includes counties in the U.S. South between 1889 and 1929. All regressions include county and year fixed effects as well as weather controls. Weather controls are January’s mean temperature and average summer precipitation from May to July. Standard errors are clustered at the county-level. Significance levels are denoted by * p < 0.10, ** p < 0.05, *** p < 0.01.
Sources: Authors’ calculations from data in Hunter and Coad (Reference Hunter and Coad1923), Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017), and Newspapers.com.
DISCUSSION OF PRACTICAL ISSUES AND FURTHER APPLICATIONS
Potential Gains, Future Applications, and Drawbacks
The replications have shown that newspaper data can be gainfully used for bias reduction in statistical analyses using historical data. We also found that the predictions based on the inequality in Equation (6) tend to hold up in applied examples. The gains in bias reduction appear to have been larger in the replication of Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) as compared to the replication of Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019). While we cannot offer a definitive explanation for this finding, a possible reason seems to be the difference in the frequency of the time dimension. The study by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) uses annual data, a much higher frequency than the five-year intervals in Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017), which potentially mitigated some of the measurement error bias. Nonetheless, results in both papers held up in our replications and could be strengthened using our methods.
Our newspaper-based boll weevil arrival measure was generated in a fast and low-cost way. Compared to the USDA measure used by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019), it appears to be more noisy, which is to be expected. It would certainly be possible to refine the measure, but doing so would increase the time and cost of collecting the information. What we want to highlight instead is that our very coarse measure still managed to produce very similar results in the two replications, meaning that both studies could have been conducted had the USDA map never existed. For the purpose of the methods introduced in this paper, it does not matter whether the data from the newspapers or the original variable (here the USDA map arrival date) is noisier as long as the measurement errors in the two variables are uncorrelated. This assumption cannot be directly tested, just as the exclusion restriction in instrumental variable regressions, for instance, but institutional knowledge and the robustness checks suggested in previous sections should help to increase our confidence in this assumption. In the boll weevil case, we also argued that this assumption holds because newspapers reported any boll weevil-related events that were observed by newspaper reporters, whereas the USDA map was created by federal entomologists. The report by Hunter and Coad (Reference Hunter and Coad1923), for which the USDA map was created, does not contain the words “newspaper,” “news,” “article,” or “journalist.” Conversely, searching Newspapers.com jointly for “USDA” and “boll weevil” only returned 59 hits. However, the majority of those hits were due to transcription errors by the character recognition software.Footnote 28
Our approach is particularly suited for measures that can be easily generated or extracted using textual data. Simple n-gram or bag-of-words approaches, as in Beach, Clay, and Saavedra (Reference Beach, Clay and Saavedra2022), Ferrara and Fishback (Reference Ferrara and Fishback2023), Albright et al. (2021), Beach and Hanlon (Reference Beach and Hanlon2023), Bazzi et al. (Reference Bazzi, Andreas Ferrara, Pearson and Testa2023), or Ottinger and Winkler (2022), are particularly promising. Anything that can be measured or extracted with a single search word or a combination of a few words lends itself to this approach and the generation of newspaper-based data. For variables such as prices, this approach is less promising because these can rarely be extracted in a low-cost way as they oftentimes require more careful extraction by hand. Generation of data from newspaper articles is likely impractical for variables that would not ordinarily be reported in the news or for which the non-random nature of the availability of digitized newspapers might be a concern. For example, measures relating to corruption or trade might be more difficult to find in newspapers. Large-scale or salient events tend to be covered in newspapers, and our boll weevil infestation example fits into this category as Lange, Olmstead, and Rhode (Reference Lange, Olmstead and Rhode2009, p. 685) noted: “the boll weevil is America’s most celebrated agricultural pest.” Other examples of such salient events studied in previous literature are the 1918 influenza pandemic (Beach, Clay, and Saavedra Reference Beach, Clay and Saavedra2022), natural disasters across the United States (Boustan et al. Reference Boustan, Kahn, Rhode and Lucia Yanguas2020), labor strikes (Schmick Reference Schmick2018), the Tulsa race massacre in 1921 (Albright et al. 2021), or the Bradlaugh-Besant trial of 1877 (Beach and Hanlon Reference Beach and Hanlon2023).Footnote 29 Among these examples, studies of the 1918 influenza, for instance, that likely could have gainfully applied our methods are Almond (Reference Almond2006), Hatchett, Mecher, and Lipsitch (Reference Hatchett, Mecher and Lipsitch2007), Hilt and Rahn (Reference Hilt and Rahn2020), or Beach, Clay, and Saavedra (Reference Beach, Clay and Saavedra2022), all of whom use an intensity measure of the flu at the local level.Footnote 30
Newspaper information can also be used to generate data at the sub-county-level. Most online archives report the city, town, or place of publication. The data can then be combined with newly available crosswalks to sub-county locations for every individual in the census and consistently defined place names that are provided by the Census Place Project (Berkes, Karger, and Nencka Reference Berkes, Karger and Nencka2023). Also, practitioners can simply search for newspaper pages containing the names of any sub-county areas for which they need to collect data. We provide an example using 960 sub-county areas (hereinafter towns) in North Carolina from Berkes, Karger, and Nencka (Reference Berkes, Karger and Nencka2023). Using newly scraped data for all pages from North Carolina newspapers that mention “boll weevil” and each town’s name, we compute the following town-level measure,
where %BW ot now captures the newspaper-based salience measure for the boll weevil in town o in year t. We then estimate the following equation,
where is an event indicator relative to the arrival of the boll weevil in town o in county c from the USDA map. Since the USDA map only provides the arrival date at the county-level, we assume that the map-based arrival date for town o is the same as its county c. The year before the arrival from the USDA map, ℓ = –1, serves as the baseline period. We include town fixed effects λ o and year fixed effects θ t, as opposed to county and state-by-year fixed effects in Equation (2). Standard errors are clustered at the town-level.
The result is shown in Online Appendix Figure A.8. Similar to our county-level analysis, newspaper analysis at the town-level finds that salience increases significantly after the arrival of the boll weevil in a given town’s county (as shown on the USDA map).Footnote 31 We also replicate the analysis in Figure 3 using three distinct towns in Alamance County, North Carolina. We find that the town-level salience measures strikingly resemble their county-level salience measure. Online Appendix Figure A.9 shows that the salience measures of Melvile (a township), Burlington (a city), and Patterson (an unincorporated community) follow a similar pattern to that of their county. Each town-level salience measure shows a small increase around 1904 and its peak around 1923. Furthermore, the maximum of MA(5) predicts that all three towns were infested by the boll weevil between 1922 and 1923. These predicted years using towns are comparable to the arrival of the boll weevil in Alamance County based on the USDA map (1922) and our newspaper-based approach (1923). Notice that we do not take a stance regarding the interpretation of a boll weevil measure at the town-level since the weevil was mainly an issue in the country side. Newspapers in towns most likely reported about the surrounding areas and not the towns themselves. The exercise here is mainly to highlight the potential usefulness of generating town-level data from digitized newspaper archives.
Practitioners must also be aware of other flaws and shortcomings affecting digitized newspaper archives. These archives do not contain the universe of all newspapers in the United States, and they also do not contain the universe of all articles. Papers from more populated places, such as larger cities, tend to be overrepresented. Particular states, such as Massachusetts, are poorly represented on Newspapers.com. Beach and Hanlon (2022) discuss these issues in more detail and provide potential solutions for attrition and sample selectivity in the context of digitized newspaper archives using newspaper directories and other external sources. Even though newspaper data can be generated at the sub-county-level and at high time frequencies, the trade-off is that increased granularity comes at the expense of a higher chance of missingness in the data and noise.
Generalizing the Method to Other Settings
Both of our replications have focused on the boll weevil. This was to demonstrate that success in reducing bias in one study by employing our newspaper-based approach was not merely a fluke. However, one remaining question is how well the methods developed in this paper extend to other settings. We therefore replicated two additional studies where the treatment variables of interest are conceptually different in nature than the arrival of the boll weevil. The first of these two additional examples is a replication of the reduced form regression in Hilt and Rahn (Reference Hilt and Rahn2020), whose right-hand side variable is a county-level measure for the average distance to the nearest military camps that seeks to proxy for the severity of the 1918 influenza epidemic.Footnote 32 Even though a distance- rather than an arrival-based measure is conceptually different from our first set of replications, one may wonder whether our setting is solely applicable to natural events, such as agricultural pests or diseases, that spread in potentially similar fashions. We therefore also consider a human-made policy, namely the county-level adoption of prohibition policies in the early twentieth century, which were studied by Howard and Ornaghi (Reference Howard and Ornaghi2021) with regards to the impact of such policies on population, farming, and investment outcomes. The Hilt and Rahn (Reference Hilt and Rahn2020) measure was a proxy to start with; hence, an argument for how a secondary measure can be helpful is easy to imagine. The prohibition adoption data used by Howard and Ornaghi (Reference Howard and Ornaghi2021) originally came from Sechrist (2012). In Online Appendix Figure A.10, we document cases of counties that were reported as being dry in newspaper articles but that were recorded as non-dry in the Sechrist data.
To generate a measure for Spanish flu severity from newspaper data, we searched Newspapers.com for articles containing the search words “flu” and the county name within each state, as before. After standardizing this measure by the total number of newspaper pages mentioning each county name, we then considered areas to be hotspots of the 1918 influenza if they were in the top decile of this measure. Lastly, we computed the average distance to influenza hotspots for each county to mimic the average distance to military camps proxied by Hilt and Rahn (Reference Hilt and Rahn2020). We use both the continuous distance measure as well as a binarized version that is equal to one for distances above the median distance. For the prohibition measure, we use the search terms “prohibition” and “dry” together with the county name in each year between 1890 and 1919 and divide this count variable by the total search hits for the county name in each year. We then predict the adoption of prohibition in each county by using the maximum of the five-year moving average of the share. We provide more detailed descriptions of how we generated our newspaper-based measures for the Spanish flu intensity distance measure and prohibition adoption in Online Appendix A.2.
Table 6 reports the results of these two replications. Column (1) shows our results using the Chalfin and McCrary (Reference Chalfin and McCrary2018) approach for the continuous distance measure, and Column (2) shows our approach using the binary median split variable. Row of Column (1) replicates the corresponding results in Online Appendix Table A.6 Column (2) in Hilt and Rahn (Reference Hilt and Rahn2020) with a coefficient of –0.473 (s.e. = 0.218). Row 1 of the same table estimates a coefficient of –0.565 (s.e. = 0.235), which uses our newspaper-based measure. This confirms that the original influenza proxy used by Hilt and Rahn (Reference Hilt and Rahn2020) was very close to other measures of influenza severity. When instrumenting their distance to military camp variable with our newspaper-based influenza measure, we estimate a coefficient of –0.627 (s.e. = 0.264), which is larger in absolute terms as the theory in Chalfin and McCrary (Reference Chalfin and McCrary2018) would predict. The same is true when using the binarized version of the distance measure, where we can now also apply our preferred approach to reduce measurement error by using an agreement sample. Here, the Democratic vote share is predicted to decline by 3.56 percentage points if a county had an above-median military camp distance. This coefficient is significant at the 1 percent level. When a county with an above-median military camp distance also had an above-median distance to the nearest influenza hotspots (i.e., if it was in the agreement sample), then the estimated reduction in the Democratic vote share was 4.91 percentage points.
Notes: Columns (1) and (2) replicate the reduced-form results in Online Appendix Table A.6 of Hilt and Rahn (Reference Hilt and Rahn2020) using the average distance to military camps (X 2) and the average distance to influenza hotspots based on newspapers (X 1). OLS and IV regressions of the Democratic Party vote share on a continuous (Column (1)) and binary (Column (2)) measure of the 1918 influenza epidemic. Both regressions are weighted by population in 1920. Columns (3)–(8) replicate Equation (1) in Howard and Ornaghi (Reference Howard and Ornaghi2021) using the introduction of Prohibition from Sechrist (2012) (X 2) and the predicted year of adoption based on newspapers (X 1). The sample only includes counties that adopted Prohibition between 1900 and 1919, both in Sechrist (2012) and our newspaper data. Columns (3)–(8) report OLS and IV regressions of economic outcome variables on an indicator for whether county c adopted prohibition after 1900 but before 1910 interacted with an indicator for the post period. The coefficients β BC are estimated using Equation (5) and the delta method. All regressions include county and state-by-year fixed effects as well as controls. Controls in Columns (1) and (2) are the share of population in urban areas and home ownership rate. Controls in Columns (3)–(8) include baseline religiosity and demographics. See Howard and Ornaghi (2021) for details. Standard errors are clustered at the county-level. Significance levels are denoted by * p < 0.10, ** p < 0.05, *** p < 0.01.
Sources: Authors’ calculations from data in Hunter and Coad (Reference Hunter and Coad1923), Hilt and Rahn (Reference Hilt and Rahn2020) , Howard and Ornaghi (Reference Howard and Ornaghi2021) , and Newspapers.com.
Columns (3) to (8) in Table 6 report the results from the replication of Howard and Ornaghi (Reference Howard and Ornaghi2021). The main takeaway from this exercise is that the agreement sample generates a significantly larger result for the estimated coefficients. We focus on results from the agreement sample, given that many of the instrumented coefficients are only noisily estimated. This highlights that certain measures are more precisely approximated with newspaper data, such as measures of distance or arrival dates and locations, especially when they are saliently featured in the news. Turning to the results, when considering log value of farm implements and log farm values as outcomes, the agreement sample estimates an effect of local prohibition policies of 0.209 (s.e. = 0.052) for the log value of farm implements and of 0.269 (s.e. = 0.064) for log farm values. This is 1.6 and 2.1 times larger than the estimates from using the Sechrist (2012) prohibition data. These are large effects that may seem implausible a priori, and Howard and Ornaghi (Reference Howard and Ornaghi2021, p. 813) say little that puts their estimates in perspective other than the following: “The increase in productivity is consistent with increased investment in labor-saving technology. The early twentieth century was a time of increased mechanization.” However, when we dug deeper into the topic ourselves, we found a contemporaneous paper in the Quarterly Journal of Economics by Coulter (Reference Coulter1912, p. 11), who found that: “In 1900 the average value of all farm property per acre of land in farms was $24.37; in 1910 it was $46.64. This is an increase of 91.4 per cent during the decade.” Considering the stark developments in American agriculture and land values at the time, a prohibition-induced farm value increase of 26.9 log points therefore appears much more reasonable. In summary, Howard and Ornaghi (Reference Howard and Ornaghi2021) were potentially able to explain much more of the change in farm values at the time with their prohibition hypothesis than what their initial study had shown.
CONCLUSION
Measurement error in historical data is often a source of bias in statistical analyses that leads to attenuation bias in the relationships that researchers seek to identify. When measurement error is classical, it is known that this attenuation bias can be removed via an instrumental variable approach. A potential instrument is a second measure of the same variable with errors, as long as the errors in two variables are uncorrelated (Chalfin and McCrary Reference Chalfin and McCrary2018). Generating such a second measure tends to be expensive, and therefore measurement error tends to be ignored as long as some conventional level of statistical significance is achieved.
In this paper, we introduce the idea of inexpensively generating such a second measure from digitized newspapers, which can be scraped or downloaded at low costs. We show how a newspaper-based secondary measure can be used to deal with measurement error when the variable of interest is either continuous or binary. The latter case is more challenging since measurement error in a binary variable is non-classical by construction, and therefore, an instrumental variable approach alone does not remove the associated bias (Bingley and Martinello Reference Bingley and Martinello2017). Instead, we propose three alternative methods for dealing with measurement error in this setting based on (i) set identification, (ii) using an agreement sample where both the primary and secondary measure give the same answer, and (iii) a parametric bias correction that can be obtained as a nonlinear combination of the OLS and IV coefficients. Our theory predicts that OLS and IV provide the lower and upper bounds of the identified set that include the true parameter, and that the coefficients from the agreement sample and the parametric bias correction should lie in between these bounds. Also, the bias-corrected estimate should still be larger in magnitude than the OLS coefficient from the agreement sample.
To test this prediction as well as to showcase our methods, we replicate two recent papers by Clay, Schmick, and Troesken (Reference Clay, Schmick and Troesken2019) and Ager, Brueckner, and Herz (Reference Ager, Brueckner and Herz2017) on the impact of the boll weevil infestation in the U.S. South between 1892 and 1922. Like most studies on the boll weevil, the main treatment is measured from a map of the pest by Hunter and Coad (Reference Hunter and Coad1923), which arguably is measured with error because of crossing lines and given that the arrival dates are an imperfect measure of the economic impact of the beetle. To produce a second measure for the boll weevil’s arrival from digitized newspaper data, we scrape Newspapers.com and search for pages that mention “boll weevil” and each county’s name from all newspapers in the county’s state. This approach maximizes the chance to find articles related to the arrival of the weevil in that county. In both replications, we find larger coefficients than in the original studies that show the usefulness of our approach to dealing with measurement error and also reaffirm the main results of the two papers. In both cases, we also find the patterns prescribed in the theoretical section, where plain OLS yields the smallest coefficient, followed by the agreement sample and the parametric bias correction.
The main contribution of the paper is to provide an easy way to generate a secondary measure for a given mismeasured variable of interest and to show how this secondary measure can be used to remove attenuation bias resulting from measurement error. We extend the framework in Chalfin and McCrary (Reference Chalfin and McCrary2018) for classical measurement error to the case where a variable is binary. The emphasis is on the newspaper data being easily available, which substantially reduces the cost of generating a secondary measure for bias correction purposes, which is usually the main prohibitive factor for researchers to apply such methods. We also contribute to a recent literature that has highlighted the usefulness of historical newspapers to generate novel data for the purpose of research in economic history.