Data Cleaning

doi:10.1017/9781009010054.022

21 - Data Cleaning

from Part IV - Statistical Approaches

Published online by Cambridge University Press: 25 May 2023

Solveig A. Cunningham and

Jonathan A. Muir

Edited by

Austin Lee Nichols and

John Edlund

Show author details

Austin Lee Nichols: Affiliation:
Central European University, Vienna
John Edlund: Affiliation:
Rochester Institute of Technology, New York

Book contents

Get access

Summary

High-quality data are necessary for drawing valid research conclusions, yet errors can occur during data collection and processing. These errors can compromise the validity and generalizability of findings. To achieve high data quality, one must approach data collection and management anticipating the errors that can occur and establishing procedures to address errors. This chapter presents best practices for data cleaning to minimize errors during data collection and to identify and address errors in the resulting data sets. Data cleaning begins during the early stages of study design, when data quality procedures are set in place. During data collection, the focus is on preventing errors. When entering, managing, and analyzing data, it is important to be vigilant in identifying and reconciling errors. During manuscript development, reporting, and presentation of results, all data cleaning steps taken should be documented and reported. With these steps, we can ensure the validity, reliability, and representative nature of the results of our research.

Keywords

Data Cleaning Data Management Quality Control Quantitative Methods

Type: Chapter
Information: The Cambridge Handbook of Research Methods and Statistics for the Social and Behavioral Sciences
Volume 1: Building a Program of Research
, pp. 443 - 467

DOI: https://doi.org/10.1017/9781009010054.022 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Atkinson, I. (2012). Accuracy of data transfer: Double data entry and estimating levels of error. Journal of Clinical Nursing, 21, 2730–2735.Google Scholar

Barchard, K. A. & Pace, L. A. (2011). Preventing human error: The impact of data entry methods on data accuracy and statistical results. Computers in Human Behavior, 27(5), 1834–1839.CrossRef Google Scholar

Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 1–52.Google Scholar

Batini, C. S. M. & Scannapieca, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Springer.Google Scholar

Brislin, R. W. (1970). Back-translation for cross-cultural research. Journal of Cross-Cultural Psychology, 1(3), 185–216.Google Scholar

Brislin, R. W. & Freimanis, C. (2001). Back-translation. In Pollard, D. E. (ed.), An Encyclopaedia of Translation: Chinese–English, English–Chinese (pp. 22–41). Chinese University Press.Google Scholar

Cope, M. R., Slack, T., Blanchard, T. C., Lee, M. R., & Jackson, J. E. (2020). The Louisiana community oil spill survey (COSS) dataset. Data in Brief, 30, 105390.Google Scholar

Cummings, J. & Masten, J. (1994). Customized dual data entry for computerized data analysis. Quality Assurance (San Diego, California), 3(3), 300–303.Google Scholar

Dasu, T. & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning, Volume 479. John Wiley & Sons.CrossRef Google Scholar

Database Error Rate (2008). Database error rate. In Kirch, W. (ed.), Encyclopedia of Public Health (pp. 196–196). Springer Netherlands. https://doi.org/10.1007/978-1-4020-5614-7_667 CrossRef Google Scholar

Day, S., Fayers, P., & Harvey, D. (1998). Double data entry: What value, what price? Controlled Clinical Trials, 19(1), 15–24.Google Scholar

Dean, A., Arner, T., Sunki, G., et al. (2011). Epi Info™, a database and statistics program for public health professionals. CDC, Atlanta, GA.Google Scholar

Harris, P. A., Taylor, R., Minor, B. L., et al. (2019). The REDCap consortium: Building an international community of software platform partners. Journal of Biomedical Informatics, 95, 103208.Google Scholar

Harris, P. A., Taylor, R., Thielke, R., et al. (2009). Research electronic data capture (REDCap): A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics, 42(2), 377–381.Google Scholar

INDEPTH Network (2002). Population and Health in Developing Countries: Volume 1; Population, Health, and Survival at INDEPTH Sites. IDRC.Google Scholar

Kaur, A. & Datta, A. (2019). Detecting and ranking outliers in high-dimensional data. International Journal of Advances in Engineering Sciences and Applied Mathematics, 11(1), 75–87.Google Scholar

Kawado, M., Hinotsu, S., Matsuyama, Y., et al. (2003). A comparison of error detection rates between the reading aloud method and the double data entry method. Controlled Clinical Trials, 24(5), 560–569.CrossRef Google Scholar PubMed

King, D. W. & Lashley, R. (2000). A quantifiable alternative to double data entry. Controlled Clinical Trials, 21(2), 94–102.Google Scholar

Koepsell, T. D. & Weiss, N. S. (2014). Epidemiologic Methods: Studying the Occurrence of Illness. Oxford University Press.Google Scholar

McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing Data: A Gentle Introduction. Guilford Press.Google Scholar

Muir, J. A., Braudt, D. B., Swindle, J., Flaherty, J., & Brown, R. B. (2018). Cultural antecedents to community: An evaluation of community experience in the United States, Thailand, and Vietnam. City & Community, 17(2), 485–503.Google Scholar

Muir, J. A., Cope, M. R., Angeningsih, L. R., Jackson, J. E., & Brown, R. B. (2019). Migration and mental health in the aftermath of disaster: Evidence from Mt. Merapi, Indonesia. International Journal of Environmental Research and Public Health, 16(15), 2726.Google Scholar

Muir, J. A., Cope, M. R., Angeningsih, L. R., & Brown, R. B. (2020a). Community recovery after a natural disaster: Core data from a survey of communities affected by the 2010 Mt. Merapi eruptions in Central Java, Indonesia. Data in Brief, 32, 106040.Google Scholar

Muir, J. A., Cope, M. R., Angeningsih, L. R., & Jackson, J. E. (2020b). To move home or move on? Investigating the impact of recovery aid on migration status as a potential tool for disaster risk reduction in the aftermath of volcanic eruptions in Merapi, Indonesia. International Journal of Disaster Risk Reduction, 46, 101478.CrossRef Google Scholar

Oni, S., Chen, Z., Hoban, S., & Jademi, O. (2019). A comparative study of data cleaning tools. International Journal of Data Warehousing and Mining (IJDWM), 15(4), 48–65.CrossRef Google Scholar

Osborne, J. W. (2013). Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. SAGE Publications.CrossRef Google Scholar

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. Available at: www.r-project.org.Google Scholar

Redman, T. C. (2001). Data Quality: The Field Guide. Digital Press.Google Scholar

Reynolds-Haertle, R. A. & McBride, R. (1992). Single vs. double data entry in CAST. Controlled Clinical Trials, 13(6), 487–494.Google Scholar

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.Google Scholar

Sadiq, S., Yeganeh, N. K., & Indulska, M. (2011). 20 years of data quality research: Themes, trends and synergies. Proceedings of the Twenty-Second Australasian Database Conference, Perth, January 17–20, Volume 115,Google Scholar

StataCorp (2021). Stata statistical software: Release 17. StataCorp LLC.Google Scholar

Van den Broeck, J., Argeseanu Cunningham, S., Eeckels, R., & Herbst, K. (2005). Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Medicine, 2(10), e267.Google Scholar