An Intelligent Missing Data Imputation Techniques: A Review

Kimseth Seu - Department of Software Convergence, Soonchunhyang University, Asan, Republic of Korea
Mi-Sun Kang - Department of Computer Software Engineering, Soonchunhyang University, Republic of Korea
HwaMin Lee - Department of Medical Informatics, Korea University, Seoul, Republic of Korea

Citation Format:



The incomplete dataset is an unescapable problem in data preprocessing that primarily machine learning algorithms could not employ to train the model. Various data imputation approaches were proposed and challenged each other to resolve this problem. These imputations were established to predict the most appropriate value using different machine learning algorithms with various concepts. Furthermore, accurate estimation of the imputation method is exceptionally critical for some datasets to complete the missing value, especially imputing datasets in medical data. The purpose of this paper is to express the power of the distinguished state-of-the-art benchmarks, which have included the K-nearest Neighbors Imputation (KNNImputer) method, Bayesian Principal Component Analysis (BPCA) Imputation method, Multiple Imputation by Center Equation (MICE) Imputation method, Multiple Imputation with denoising autoencoder neural network (MIDAS) method. These methods have contributed to the achievable resolution to optimize and evaluate the appropriate data points for imputing the missing value. We demonstrate the experiment with all these imputation techniques based on the same four datasets which are collected from the hospital. Both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are utilized to measure the outcome of implementation and compare with each other to prove an extremely robust and appropriate method that overcomes missing data problems. As a result of the experiment, the KNNImputer and MICE have performed better than BPCA and MIDAS imputation, and BPCA has performed better than the MIDAS algorithm.


Data imputation technique; missing data; machine learning; deep learning

Full Text:



T. Nakashima et al., “Machine learning model for predicting out-of-hospital cardiac arrests using meteorological and chronological data,†Heart, vol. 107, no. 13, pp. 1084-1091, May. 2021.

S. L. Layeghian and M. M. Sepehri, “A predictive framework in healthcare: Case study on cardiac arrest prediction,†Artificial Intelligence in Medicine, vol. 117, pp. 102099, Jul. 2021.

J. M. Kwon et al., “Deep-learning-based out-of-hospital cardiac arrest prognostic system to predict clinical outcomes,†Resuscitation, vol. 139, pp. 84-91, Jun. 2019.

H. de Silva and A. S. Perera, “Missing data imputation using Evolutionary k- Nearest neighbor algorithm for gene expression data,†in IEEE 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer), 2016.

C. Crambes and Y. Henchiri, “Regression imputation in the functional linear model with missing values in the response,†Journal of Statistical Planning and Inference, vol. 201, Dec. 2018.

P. Keerin and T. Boongoen, “Improved KNN Imputation for Missing Values in Gene Expression Data,†Computers, Materials & Continua, Jun. 2021.

C. Y. Cheng, W. L. Tseng, C. F. Chang, C. H. Chang, and S. S. F. Gau, “A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder,†Frontiers in Psychiatry, vol. 11, pp. 673, 2020.

S. J. Choudhury and N. R. Pal, “Imputation of missing data with neural networks for classification,†Knowledge-Based Systems, vol. 182, pp. 104838, 2019.

A. Sportisse, C. Boyer, and J. Josse, “Estimation and imputation in Probabilistic Principal Component Analysis with Missing Not At Random data,†Dec. 2020.

J. Podani, T. Kalapos, B. Barta, and D. Schmera, “Principal component analysis of incomplete data – A simple solution to an old problem,†Ecological Informatics, vol. 61, pp. 101235, Mar. 2021.

W. C. Lin and C. F. Tsai, “Deep learning for missing value imputation of continuous data and the effect of data discretization,†Knowledge-Based Systems, vol. 239, pp. 108079, Mar. 2022.

T. Emmanuel, et al., “A survey on missing data in machine learning,†Journal of Big Data, vol. 8, pp. 104, Sep. 2021.

R. Armina, A. M. Zain, N. A. Ali, and R Sallehuddin, “A Review on Missing Value Estimation Using Imputation Algorithm,†Journal of Physics: Conference Series, vol. 892, 2017.

O. Troyanskaya et al., “Missing value estimation methods for DNA microarrays,†Bioinformatics, vol. 17, pp. 520-525, Jun. 2001.

V. Audigier, F. Husson, and J. Josse, “Multiple imputation for continuous variables using a Bayesian principal component analysis,†Journal of Statistical Computation and Simulation, vol. 86, no. 11, pp. 2140-2156, 2016.

S. Bose, C. Das, T. Gangopadhyay, and S. Chattopadhyay, “A Modified Local Least Squares-Based Missing Value Estimation Method in Microarray Gene Expression Data,†in IEEE 2013 2nd International Conference on Advanced Computing, Networking and Security, pp. 18-23, 2013.

X. Zhang, X. Song, H. Wang, and H. Zhang, “Sequential local least squares imputation estimating missing value of microarray data,†Computers in Biology and Medicine, vol. 38, pp. 1112-1120, Oct. 2008.

R. Jörnsten, H. Y. Wang, W. J. Welsh, and M. Ouyang, “DNA microarray data imputation and significance analysis of differential expression,†Bioinformatics, vol. 21, no. 22, pp. 4155-4161, Nov. 2005.

M. H. Nadimi-Shahraki, et al., “A Hybrid Imputation Method for Multi-Pattern Missing Data: A Case Study on Type II Diabetes Diagnosis,†Electronics, vol. 10, no. 24, Dec. 2021.

A. M. Sefidian and N. Daneshpour, “Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model,†Expert Systems with Applications, vol.115, pp. 68-94, Jan. 2019.

Y. Y. Choi, H. Shon, Y. J. Byon, D. K. Kim, and S. Kang, “Enhanced Application of Principal Component Analysis in Machine Learning for Imputation of Missing Traffic Data,†Applied Sciences, vol. 9, no. 10, May. 2019.

T. Köse, S. Özgür, E. Coşgun, A. Keskinoğlu, and P. Keskinoğlu, "Effect of Missing Data Imputation on Deep Learning Prediction Performance for Vesicoureteral Reflux and Recurrent Urinary Tract Infection Clinical Study," BioMed Research International, vol. 2020, 2020.

J. H. Jang et al., “Deep Learning Approach for Imputation of Missing Values in Actigraphy Data: Algorithm Development Study.†JMIR mHealth and uHealth, vol. 8, no. 7, Jul. 2019.

D. Xu, P. J. H. Hu, T. S. Huang, X. Fang, and C. C. Hsu, “A deep learning–based, unsupervised method to impute missing values in electronic health records for improved patient management.†Journal of Biomedical Informatics, vol. 111, pp. 103576, Nov. 2020.

J. Lin, N. Li, M. A. Alam, and Y. Ma, “Data-driven missing data imputation in cluster monitoring system based on deep neural network,†Applied Intelligence, vol. 50, pp. 860-877, Oct. 2020.

S. Nikfalazar, C. H. Yeh, S. Bedingfield, and H. A. Khorshidi, “A new iterative fuzzy clustering algorithm for multiple imputation of missing data,†in IEEE International Conference on Fuzzy Systems, pp. 1-6, 2017.

R. Lall and T. Robinson, “The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning,†Cambridge University Press, vol. 30, pp. 179-196, 2021.

A. Pantanowitz, and T. Marwala, “Missing Data Imputation Through the Use of the Random Forest Algorithm,†Advances in Computational Intelligence, vol. 116, Jan. 2009.

S. Nikfalazar, C. Yeh, S. Bedingfield, “A Hybrid Missing Data Imputation Method for Constructing City Mobility Indices,†Australasian Conference on Data Mining, vol. 996, pp. 135-148, 2019.

X. Gan, A. W. C. Liew, and H. Yan, “Microarray missing data imputation based on a set theoretic framework and biological knowledge,†Nucleic Acids Research, vol.34, pp. 1608-1619, Mar. 2006.

J. Tuikkala, L. Elo, O. S. Nevelainen, and T. Aittokallio, “Improving missing value estimation in microarray data with gene ontology,†Bioinformatics, vol. 22, no. 5, pp. 566-572, Mar. 2006.

Q. Xiang et al., “Missing value imputation for microarray gene expression data using histone acetylation information,†BMC Bioinformatics, vol. 9, pp. 252, May. 2008.