Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-18T18:09:17.819Z Has data issue: false hasContentIssue false

The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Published online by Cambridge University Press:  26 February 2021

Ranjit Lall*
Affiliation:
Department of International Relations, London School of Economics and Political Science, London, UK. Email: [email protected]
Thomas Robinson
Affiliation:
School of Government and International Affairs, Durham University, Durham, UK. Email: [email protected]
*
Corresponding author Ranjit Lall

Abstract

Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS’s accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Type
Article
Copyright
© The Author(s) 2021. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Beaulieu-Jones, B. K., and Greene, C.. 2016. “Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification.” Journal of Biomedical Informatics 64(2):168178.CrossRefGoogle ScholarPubMed
Cranmer, S. J., and Gill, J.. 2013. “We Have to be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data.” British Journal of Political Science 43(2):425449.CrossRefGoogle Scholar
Duan, Y., Lv, Y., Kang, W., and Zhao, Y.. 2014. “A Deep Learning Based Approach for Traffic Data Imputation.” In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 912917. New York: IEEE.CrossRefGoogle Scholar
Gal, Y., and Ghahramani, Z.. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning, 10501059. New York: ACM.Google Scholar
Gondara, L., and Wang, K.. 2018. “Mida: Multiple Imputation Using Denoising Autoencoders.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining: Advances in Knowledge Discovery and Data Mining, 260272. Cham: Springer.CrossRefGoogle Scholar
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. Neural Networks 2:118.Google Scholar
Honaker, J., and King, G.. 2010. “What to Do About Missing Values in Time-Series Cross-Section Data.” American Journal of Political Science 54(2):561581.CrossRefGoogle Scholar
Honaker, J., King, G., and Blackwell, M.. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45(7):147.CrossRefGoogle Scholar
King, G., Honaker, J., Joseph, A., and Scheve, K.. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95(1):4969.CrossRefGoogle Scholar
Kropko, J., Goodrich, B., Gelman, A., and Hill, J.. 2014. “Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches.” Political Analysis 22(4):497519.CrossRefGoogle Scholar
Lall, R. 2016. “How Multiple Imputation Makes a Difference.” Political Analysis 24(4):414433.CrossRefGoogle Scholar
Lall, R., and Robinson, T.. 2020. “Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning.“ https://doi.org/10.7910/DVN/UPL4TT, Harvard Dataverse, V1, UNF:6:nx0l6jH3yhFhdUA34V9V/g== [fileUNF].Google Scholar
Little, R. J., and Rubin, D.. 1987. Statistical Analysis with Missing Data. New York: Wiley.Google Scholar
Novo, A. A. 2015. Norm. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar
Ramseyer, J. M., and Rasmussen, E. B.. 2016. “Voter Ideology: Regression Measurement of Position on the Left-Right Spectrum.” Working Paper.Google Scholar
Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.CrossRefGoogle Scholar
Schafer, J. L., and Olsen, M. K.. 1998. “Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective.” Multivariate Behavioral Research 33(4):545571.CrossRefGoogle ScholarPubMed
Su, Y.-S., Gelman, A., Hill, J., and Yajima, M.. 2011. “Multiple Imputation With Diagnostics (mi) in r: Opening Windows into the Black Box.” Journal of Statistical Software 45(2):131.CrossRefGoogle Scholar
van Buuren, S. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Taylor and Francis.CrossRefGoogle Scholar
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International Conference on Machine Learning, 10961103. New York: ACM.CrossRefGoogle Scholar
Supplementary material: Link

Lall and Robinson Dataset

Link
Supplementary material: PDF

Lall and Robinson supplementary material

Lall and Robinson supplementary material

Download Lall and Robinson supplementary material(PDF)
PDF 396.9 KB