Missing Data Imputation Methods in Classification Contexts

Authors: Juheng Zhang
DIN
IJOER-APR-2016-21
Abstract

We examine different imputation methods that deal with missing data in classification contexts and compare the performance of the methods with an experiment study. We investigate the performance of the methods under the assumption that data are missing at random. We find that, as the number of missing holes in data increases, the imputation methods deteriorate and the misclassification rates of the imputation methods increase. We also examine the scenario where missing data are due to strategic behaviors of data providers. We find that imputation methods play an important role at deterring strategic behaviors of data providers and minimizing the misclassification rate. 

Keywords
missing data imputation method classification.
Introduction

Often in many empirical studies, data are missing due to various reasons. Missing data may be caused by negligence of data collectors, poor experiment designs or procedures, or even purposely hiding behaviors of data providers. The two general assumptions of missing data are: data missing at random and data missing strategically. Randomly missing data assumption assumes that the missing data of an attribute are not related to the values themselves nor the values of other attributes. For instance, in U.S. census data, a specific home address is missing, which is likely due to a random reason. As for strategically missing data assumption, the data are missing due to strategic reasons. For instance, an insurance applicant can purposely hide her/his smoking/drinking when apply for a health insurance in hope for a more likely result of approval. Another example is limited information disclosure in financial markets [5, 6]. Certain companies strategically hide information from investors. Missing data are a common problem in many research fields such as economics, marketing, health, statistics, psychology, and education.

Missing data can lead to a number of problems [8]. The high level of statistical power requires a large amount of data. When data are missing, sample size decreases dramatically if only observations with complete data are used. Empirical studies found that if two percent of data are missing randomly in a data set, then eighteen percent of the total data can be lost when observations having a missing value are removed. Missing data decreases statistical power.

In this study, we consider different imputation methods that either designed for randomly missing data or strategically missing data. We compare the performance of the imputation methods in classification contexts under the assumption of data missing at random. We also examine the imputation methods when data providers act strategically and data are hidden intentionally. In the following section, we overview related research works and briefly discuss different imputation methods. 

Conclusion

We compare eight different imputation methods in the case where data are missing at random and in the case where data are missing strategically. We find that as the percent of missing data increases, the performance of all the eight imputation methods decreases. When data are missing strategically, the D method or DNeg method gives the lowest misclassification rate.  

Article Preview