Diet-related causes are the leading risk factor for death and disability globally(Reference Soriano, Abajobir and Abate1), and as of 2017, responsible for 11 million deaths and 255 million disability-adjusted life years(Reference Afshin, John Sur and Fay2). In an effort to address this growing burden, dietary behaviour research has focused on associations between retail food environment and diet(Reference Kleinert and Horton3), including how the community distribution of retailers, and consumer factors such as food price, access, availability and promotion can influence diet and disease(Reference Giskes, van Lenthe and Avendano-Pabon4–Reference Glanz, Sallis and Saelens6). Within and across studies, there remains a great deal of heterogeneity in terms of methods, outcomes and data quality(Reference McKinnon, Reedy and Morrissette7–Reference Lytle and Sokol9).
The majority of food environment studies have used geographic information systems to process community-level data(Reference McKinnon, Reedy and Morrissette7–Reference Lytle and Sokol9). An important methodological issue in this literature is the use of commercially produced, unvalidated secondary datasets in the assessment of community food environments and a source of potential misclassification bias(Reference Lebel, Daepp and Block10,Reference Fleischhacker, Evenson and Sharkey11) . For instance, in a recent case study, Lebel et al. noted variable levels of correlation between per capita exposures defined by commercially available business (CAB) data and government data when stratified by store type(Reference Lebel, Daepp and Block10).
Studies of CAB data are often structured as a comparison between CAB data and a ‘gold-standard’ government and/or primary dataset(Reference Lebel, Daepp and Block10,Reference Fleischhacker, Evenson and Sharkey11) . These studies have applied conventional epidemiological diagnostic measures such as sensitivity and positive-predictive value (PPV)(Reference Trevethan12) to assess the accuracy of a CAB dataset(Reference Lebel, Daepp and Block10). Previous validation studies in the North American context have demonstrated a wide range of agreement between CAB and other datasets(Reference Lebel, Daepp and Block10,Reference Clary and Kestens13,Reference Daepp and Black14) . While some studies have reported high levels of agreement between commercial and governmental datasets in urban centres(Reference Lebel, Daepp and Block10), others have indicated that government datasets are less error-prone and may be better for specific food environment measures(Reference Daepp and Black14). As the literature has grown, data accuracy in rural contexts has increasingly been studied(Reference Lebel, Daepp and Block10,Reference Fleischhacker, Evenson and Sharkey11) . For example, density measures(Reference Lebel, Daepp and Block10) and representativity(Reference Clary and Kestens13) are advanced diagnostic analyses that aim to assess the impact of data accuracy in relation to disease risk; these newer analyses can further highlight the differential impact of error on rural and urban exposures.
There is a relative dearth of literature that explores the rates of error across regions of variable rurality and stratifies by store type in Canada. Although several studies have focused on rural regions in the USA(Reference Sharkey15–Reference McGuirt, Jilcott and Vu19) and UK(Reference Lake, Burgoine and Stamp20–Reference Wilkins, Radley and Morris22), rural Canada remains an understudied jurisdiction(Reference Minaker, Shuh and Olstad23) with disproportionate levels of diet-related risk factors and poor health(Reference Bruner, Lawson and Pickett24–Reference Shearer, Blanchard and Kirk27). The objective of this paper is to compare government and commercial datasets at the provincial scale using diagnostic measures of agreement, across a spectrum of population centres, within industry-defined store classifications and to report associated rates of geospatial error.
Methods
Setting
Newfoundland and Labrador (NL) is the easternmost province of Canada. Residents of NL report among the largest burden of diet-related noncommunicable diseases(28–30) and obesity in Canada, and population-based assessments of dietary intake are significantly poorer(Reference Garriguet31–35) than elsewhere.
As of the 2016 census, NL is home to 519 716 people, and the majority reside in rural areas. Our definition of rurality in the current study employs the Statistics Canada population centre classification scheme(36). A population centre is a dissemination block or set of contiguous dissemination blocks with population count >1000 and density >400 persons per km2 (154 persons per square mile). Large population centres consist of areas with population counts >100 000; medium, 30 000–99 999 and small 1000–9999 people. All areas outside these blocks are classified as rural(36). In NL, there are 27 small population centres (approximately 24 % of the population), no medium centres, 1 large centre (approximately 34 % of the population) and the remainder are rural areas (approximately 42 % of the population). For comparison, population proportions for Canada: 13 % residing in small population centres; 9 % medium; 60 % large and 19 % in rural areas.
Data
The enhanced points of interest (EPOI) are a set of geocoded business and recreation points across Canada, compiled by DMTI Spatial. Attributes for this data include North American Industry Classification System (NAICS) and Standard Industrial Classification (SIC) codes, street address and phone number. EPOI 2015 data were provided by the Dalhousie GIS Centre and points located in NL with a primary NAICS code of 445110 for Grocery Stores and Other Grocery, 445120 for Convenience Stores or 447190 and 447110 for Gas Stations were extracted.
EPOI data from the year 2015 were compared with a 2015 dataset from the NL provincial government. The government dataset was administrative data consisting of all licensed food premises in NL as of March 2015, obtained from the department responsible for on-site food safety inspections, governed by food premises legislation. Official government listings are considered a ‘gold standard’ alternative to researcher ground-truthed data(Reference Lebel, Daepp and Block10). A detailed description of the dataset has been provided previously(Reference Mah, Pomeroy and Knox37). Briefly, the inventory was cleaned and coded by business ownership type and NAICS code in consultation with government and a subsample verified using Google street view. The same NAICS categories employed for the EPOI data were used to classify the NL data, although it is noteworthy that there was an observable divergence in coded stores (Table 1). Further, it is important to consider that this strategy applied NAICS definitions consistently at all levels of rurality.
NAICS, North American Industry Classification System.
Estimates are provided with CI.
* Values include outliers for a discussion of the impacts of outliers on mean estimates sees the results section.
For all following calculations, the government dataset is the gold-standard, and the CAB dataset is the test.
Matching algorithm
Stores in the EPOI layer were matched to stores in the ground-truthed layer by first examining the name of the store. If the name was an exact match, a verification of matching addresses was performed, and the stores were considered matches. If the store names were similar, the address and coordinates were matched; if the addresses were then identical, the stores were considered a match. In the event of missing or incomplete data, the completed fields were verified through the Yellow Pages directory or through manual search engine verification. Stores included in one of the datasets but not the other were matched with a blank record for all fields.
Analysis
To perform our analysis, the CAB dataset was assessed for accuracy with respect to the gold-standard. We cross-tabulated all true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Other descriptive statistics were then performed for TP, FP and FN stores. These cross-tabulations were repeated for the data after stratification by store-type, ownership and rurality. Stratification is a commonly used method to assess differences in exposures employed in food environment studies(Reference Daepp and Black14). Rural–urban classifications as well as store type and size are well-described potential stratifiers in the rural and regional context(Reference Cummins, Smith and Aitken38), and ownership (chain/independent) is an important emerging attribute given food industry consolidation(Reference Wrigley, Coe and Currah39).
Similar diagnostic assessments of secondary food environment datasets have been reported in detail elsewhere(Reference Daepp and Black14). To assess the accuracy of the CAB dataset in relation to the government dataset, three conventional diagnostic indicators were used. Sensitivity calculated as ${{{\rm{TP}}}}\over{{{\rm{TP}} + {\rm{FN}}}}$, PPV calculated as ${{{\rm{TP}}}}\over{{{\rm{TP}} + {\rm{FP}}}}$ and concordance calculated as ${{{\rm{TP}}}}\over{{{\rm{TP}} + {\rm{FP}} + {\rm{FN}}}}$. Diagnostic values of less than 0·001 % (<0·001) were considered poor, below 20 % (95 % CI 0·001, 0·20) slight, 21–40 % (95 % CI 0·21, 0·40) fair, 41–60 % (95 % CI 0·41, 0·60) moderate, 61–80 % (95 % CI 0·61, 0·80) good and over 81 % (95 % CI 0·81, 1·00) almost perfect, similar to a previous report that employed the Landis scale(Reference Lebel, Daepp and Block10,Reference Landis and Koch40) . All CI were calculated by approximating a normal distribution as reported previously(Reference Ma, Battersby and Bell41).
Finally, of the true-positive stores found in both datasets, values for NAICS code, store community as defined in both datasets (city, town or community name as listed) and geocoded positional accuracy within 100 m by Euclidean distance were compared for the EPOI data and the provincial dataset. To determine geocoded point accuracy, the distance between the provincial dataset point and its corresponding EPOI point was determined with the formula: ${\sqrt {{{({x_1} - \;{x_2})}^2} + {{\left( {{y_1} - \;{y_2}} \right)}^2}} }$. All units were converted to metres. Analyses were performed with and without outliers, any values that fell 1·5 times the interquartile range above the third quartile or below the first quartile were deemed outliers. To avoid multiple regional projections and calculate relatively conservative estimates of error, EPSG:3857 was the coordinate system employed for point assignment and distance analysis.
All spatial analyses were conducted in ArcMap (ESRI), all statistical analyses were conducted in R version 3.4.2 ‘Short Summer’.
Results
Table 2 shows the overall and store-level sample size, sensitivity, PPV and concordance between the commercial and government datasets. Overall agreement between the datasets was fair to moderate (Table 2). Grocery stores demonstrated the greatest agreement between datasets, then convenience stores and finally gas stations. This categorical trend in agreement was observed to be true for all three diagnostic indicators employed in our analysis. Less than 20 % of all gas stations in both datasets were TP.
PPV, positive-predictive value.
Estimates are provided with CI.
Table 3 shows the number of TP, FP and FN by store-type (NAICS), ownership (government dataset only) and population centre. The majority by store type was convenience stores and by population centre was rural (Table 3). Unfortunately, due to the lack of ownership information in the secondary dataset, the ownership distribution of FP could not be calculated, but the bulk of both TP and FN were stores that were independently owned.
Estimates are provided with CI.
The true-positive store accuracy in terms of NAICS assignments, community agreement, geospatial accuracy and overall agreement is shown in Table 1. Although the majority of true-positive stores were in rural areas (Table 3), the industrial classification and both measures of spatial accuracy were lowest in rural areas (Table 1). Just 4 % of stores in rural areas were classified accurately in terms of spatial location and NAICS assignment.
Small population centres and large population centres had similar sample sizes; both industrial classification agreement and spatial accuracy were poorer for small population centres (Table 1).
Finally, of the 380 true-positive stores, the maximum positional error between a truth and test point was a striking 372 km. The minimum error was <0·001 km, and the mean error was 17·72 km; 11 % of the sample, forty-two values, were 1·5 times the interquartile range above the third quartile (positive skew). Following outlier removal, the maximum positional error fell to 15·95 km and the mean error fell to 2·79 km. A sensitivity analysis of the buffer threshold for geocoding error revealed an approximately logarithmic curve in the percent of stores considered accurately geocoded as the buffer was relaxed from 100 m to approximately 20 km, at which point 90 % of TP stores were considered accurate.
Discussion
This investigation of the validity of a CAB dataset using a government ground-truthed dataset has demonstrated variable levels of accuracy across population centres and industrial classifications. Notably, we observed grocery stores were captured in the CAB dataset with the highest accuracy among store types tested. Stores in large population centres had the greatest levels of spatial and NAICS code agreement across NL.
CAB data are an obvious entry point to food environment research, largely due to its relative ease of access, but our work, confirming and expanding on existing literature in this area in the rural Canada context, suggests that CAB data are prone to error, with potential differential consequences for study outcomes depending on the environment-diet hypothesis under study(Reference Lebel, Daepp and Block10,Reference Fleischhacker, Evenson and Sharkey11) . For instance, research focused on absolute access to major grocery stores in urban centres may find CAB data viable for its purpose. Yet research in rural settings, which may prioritise access to the nearest available grocery outlet – often, small general stores, convenience stores or gas stations – may suffer from associated bias and lead to inconclusive or inexact conclusions regarding the health impacts of store access or lack thereof. This may be one of the myriad reasons that community food environment predictors of diet remain ambiguous(Reference McKinnon, Reedy and Morrissette7,Reference Lytle and Sokol9,Reference Minaker, Shuh and Olstad23) .
In comparison with other Canadian provinces, NL has a relatively high proportion of stores in rural areas and a high proportion of convenience stores(42). Despite this, NAICS classification and geocoding accuracy were still higher in both small and large population centres than rural areas. Although few regional assessments of food environments have been conducted at a province-wide scale, several potential explanations may be at play. First, large population centres may have more stable business markets, reducing business turnover and minimising the associated error with openings/closings. Second, if the CAB dataset geocoding strategy is based on street address and/or address proxies such as postal codes, the spatial algorithms that assign business locations may suffer in sparsely connected rural regions(Reference Khan, Pinault and Tjepkema43) or differ by commercial vendors(Reference Whitsel, Rose and Wood44). Indeed, a potentially critical difference between CAB datasets and government datasets may be the address files employed to geocode business locations. Measures can be taken to improve geocoding, but these measures are only effective if the underlying business information and geocoding address files are contemporaneously accurate(Reference McDonald, Schwind and Goldberg45,Reference Faure, Danjou and Clavel-Chapelon46) . Third, if our observation that grocery stores are captured by the CAB with the highest levels of accuracy is related to random and not systematic error, the decreased accuracy in rural areas may reflect a relative dearth of grocery stores in the jurisdiction.
Varying degrees of CAB data accuracy by store-type and location have significant implications for research design and data sourcing considerations. Food environment researchers in rural settings can pursue complementary strategies for data collection and triangulation(Reference Sharkey15,Reference Caspi and Friebur47–Reference Fleischhacker, Evenson and Sharkey49) , and government data may provide a better point of departure(Reference Daepp and Black14). Further, although CAB data may be useful in urban settings, our data suggest that smaller population centres may still suffer from bias; the urban utility of CAB data may only extend to grocery stores in large population centres.
Limitations
Due to the potential impact of sample size on rates of diagnostic errors, there is potential for stratification to have affected diagnostic outcomes(Reference Lebel, Daepp and Block10). Additionally, the government dataset we used as a gold-standard is secondary administrative data; frequency of site visits is tailored to level of food safety risk and in our jurisdiction, food inspection data are ideally reviewed and collected in 3-year cycles(Reference Mah, Pomeroy and Knox37). The collection and ground-truthing of these data are constrained by associated limitations in public health service budgets. To address this issue, partnerships with government throughout the research were used to address data access and accuracy.
Conclusion
The current research is the first province-wide data validation analysis in Atlantic Canada and is potentially generalisable to other heterogeneous sub-national jurisdictions (regions, states, territories). The use of geographic scales that align with government administrative regions has significance for policymaking. Our findings suggest that CAB data are less accurate in rural regions and may identify and classify grocery stores with higher accuracy than convenience stores and gas stations. Researchers should evaluate multiple data collection strategies at the community level and partner with local institutions when possible. It is crucial to recognise that dataset errors may in fact be a function of policy-relevant service considerations. The current research emphasises the impact of systematic error in CAB data for researchers working in rural sampling frames or implementing an analytic strategy encompassing retailers not classified as grocers.
Acknowledgements
Acknowledgements: The Dalhousie GIS Centre provided access to the commercial data. Thanks to the Centre of Geographic Sciences for support throughout the project. The NL Statistics Agency and Service NL assisted with access to and coding of administrative data. Financial support: This work was supported in part by Health Canada’s Office of Nutrition Policy and Promotion (MOA #4500327812 to C.L.M.); the Canadian Institutes of Health Research (FRN PG1-144782 to C.L.M.); the Canada Research Chairs program (to C.L.M.) and the Nova Scotia Graduate Scholarship program (to N.G.A.T.). Conflict of interest: None to declare. Authorship: N.G.A.T. co-led the data processing, analysis and interpretation and led manuscript preparation; J.S. led data processing and analysis; C.M. formulated the research question and study design and led implementation, interpretation, co-led manuscript preparation and supervised all components of the project. Ethics of human subject participation: This research did not involve human participants. This project used institutionally and commercially available retailer data, and this work did not require the approval of a research ethics board.