Skip to main content Accessibility help
×
Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2025-01-01T08:05:18.889Z Has data issue: false hasContentIssue false

21 - Data Cleaning

from Part IV - Statistical Approaches

Published online by Cambridge University Press:  25 May 2023

Austin Lee Nichols
Affiliation:
Central European University, Vienna
John Edlund
Affiliation:
Rochester Institute of Technology, New York
Get access

Summary

High-quality data are necessary for drawing valid research conclusions, yet errors can occur during data collection and processing. These errors can compromise the validity and generalizability of findings. To achieve high data quality, one must approach data collection and management anticipating the errors that can occur and establishing procedures to address errors. This chapter presents best practices for data cleaning to minimize errors during data collection and to identify and address errors in the resulting data sets. Data cleaning begins during the early stages of study design, when data quality procedures are set in place. During data collection, the focus is on preventing errors. When entering, managing, and analyzing data, it is important to be vigilant in identifying and reconciling errors. During manuscript development, reporting, and presentation of results, all data cleaning steps taken should be documented and reported. With these steps, we can ensure the validity, reliability, and representative nature of the results of our research.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Atkinson, I. (2012). Accuracy of data transfer: Double data entry and estimating levels of error. Journal of Clinical Nursing, 21, 27302735.Google Scholar
Barchard, K. A. & Pace, L. A. (2011). Preventing human error: The impact of data entry methods on data accuracy and statistical results. Computers in Human Behavior, 27(5), 18341839.CrossRefGoogle Scholar
Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 152.Google Scholar
Batini, C. S. M. & Scannapieca, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Springer.Google Scholar
Brislin, R. W. (1970). Back-translation for cross-cultural research. Journal of Cross-Cultural Psychology, 1(3), 185216.Google Scholar
Brislin, R. W. & Freimanis, C. (2001). Back-translation. In Pollard, D. E. (ed.), An Encyclopaedia of Translation: Chinese–English, English–Chinese (pp. 2241). Chinese University Press.Google Scholar
Cope, M. R., Slack, T., Blanchard, T. C., Lee, M. R., & Jackson, J. E. (2020). The Louisiana community oil spill survey (COSS) dataset. Data in Brief, 30, 105390.Google Scholar
Cummings, J. & Masten, J. (1994). Customized dual data entry for computerized data analysis. Quality Assurance (San Diego, California), 3(3), 300303.Google Scholar
Dasu, T. & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning, Volume 479. John Wiley & Sons.CrossRefGoogle Scholar
Database Error Rate (2008). Database error rate. In Kirch, W. (ed.), Encyclopedia of Public Health (pp. 196196). Springer Netherlands. https://doi.org/10.1007/978-1-4020-5614-7_667CrossRefGoogle Scholar
Day, S., Fayers, P., & Harvey, D. (1998). Double data entry: What value, what price? Controlled Clinical Trials, 19(1), 1524.Google Scholar
Dean, A., Arner, T., Sunki, G., et al. (2011). Epi Info™, a database and statistics program for public health professionals. CDC, Atlanta, GA.Google Scholar
Harris, P. A., Taylor, R., Minor, B. L., et al. (2019). The REDCap consortium: Building an international community of software platform partners. Journal of Biomedical Informatics, 95, 103208.Google Scholar
Harris, P. A., Taylor, R., Thielke, R., et al. (2009). Research electronic data capture (REDCap): A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics, 42(2), 377381.Google Scholar
INDEPTH Network (2002). Population and Health in Developing Countries: Volume 1; Population, Health, and Survival at INDEPTH Sites. IDRC.Google Scholar
Kaur, A. & Datta, A. (2019). Detecting and ranking outliers in high-dimensional data. International Journal of Advances in Engineering Sciences and Applied Mathematics, 11(1), 7587.Google Scholar
Kawado, M., Hinotsu, S., Matsuyama, Y., et al. (2003). A comparison of error detection rates between the reading aloud method and the double data entry method. Controlled Clinical Trials, 24(5), 560569.CrossRefGoogle ScholarPubMed
King, D. W. & Lashley, R. (2000). A quantifiable alternative to double data entry. Controlled Clinical Trials, 21(2), 94102.Google Scholar
Koepsell, T. D. & Weiss, N. S. (2014). Epidemiologic Methods: Studying the Occurrence of Illness. Oxford University Press.Google Scholar
McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing Data: A Gentle Introduction. Guilford Press.Google Scholar
Muir, J. A., Braudt, D. B., Swindle, J., Flaherty, J., & Brown, R. B. (2018). Cultural antecedents to community: An evaluation of community experience in the United States, Thailand, and Vietnam. City & Community, 17(2), 485503.Google Scholar
Muir, J. A., Cope, M. R., Angeningsih, L. R., Jackson, J. E., & Brown, R. B. (2019). Migration and mental health in the aftermath of disaster: Evidence from Mt. Merapi, Indonesia. International Journal of Environmental Research and Public Health, 16(15), 2726.Google Scholar
Muir, J. A., Cope, M. R., Angeningsih, L. R., & Brown, R. B. (2020a). Community recovery after a natural disaster: Core data from a survey of communities affected by the 2010 Mt. Merapi eruptions in Central Java, Indonesia. Data in Brief, 32, 106040.Google Scholar
Muir, J. A., Cope, M. R., Angeningsih, L. R., & Jackson, J. E. (2020b). To move home or move on? Investigating the impact of recovery aid on migration status as a potential tool for disaster risk reduction in the aftermath of volcanic eruptions in Merapi, Indonesia. International Journal of Disaster Risk Reduction, 46, 101478.CrossRefGoogle Scholar
Oni, S., Chen, Z., Hoban, S., & Jademi, O. (2019). A comparative study of data cleaning tools. International Journal of Data Warehousing and Mining (IJDWM), 15(4), 4865.CrossRefGoogle Scholar
Osborne, J. W. (2013). Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. SAGE Publications.CrossRefGoogle Scholar
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. Available at: www.r-project.org.Google Scholar
Redman, T. C. (2001). Data Quality: The Field Guide. Digital Press.Google Scholar
Reynolds-Haertle, R. A. & McBride, R. (1992). Single vs. double data entry in CAST. Controlled Clinical Trials, 13(6), 487494.Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581592.Google Scholar
Sadiq, S., Yeganeh, N. K., & Indulska, M. (2011). 20 years of data quality research: Themes, trends and synergies. Proceedings of the Twenty-Second Australasian Database Conference, Perth, January 17–20, Volume 115,Google Scholar
StataCorp (2021). Stata statistical software: Release 17. StataCorp LLC.Google Scholar
Van den Broeck, J., Argeseanu Cunningham, S., Eeckels, R., & Herbst, K. (2005). Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Medicine, 2(10), e267.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×