Handling missing values
As you may know, one of the most important, or at least time consuming, part of the whole data mining process is the data preprocessing. One common task that has to be done concerns the missing values. In most databases, they are noted NaN (Not a Number) or simply ?. Before normalizing or standardizing a data set, you should take care of these values.
For handling missing values, several techniques exist:
In many research, the technique used to handle missing values is not explicitly mentioned. A lot of data mining research papers use the UCI repository for testing algorithms. However, most of the data sets present on this repository have (several) missing values. So, what? Did they ignore the records or use a specific method? And what do you use for handling missing values?
6 comments:
This is actually a very well studied subject in statistics, so there is less need for guessing. I think this is mostly a question of how rigorous one wants to be. I list several resources on this subject in Missing Values and Special Values: The Plague of Data Analysis.
I would propose to add another criteria to Sandro's list.
Treat the missing value as "additional information" rather than substituting a value.
An illustration for this would be a response to a survey question "How many years of college did you attend", with the choices being 0-4, 5 or more.
Someone who is easily bored might leave the answer to this (as well as many other questions) blank (missing), and it might be considered incorrect to impute a value for missing, when it would be more important to understand that this respondent loses interest easily and might be a bad target for a potential product.
Ralph Winters
While it may sometimes be useful to code a "shadow variable", indicating missing/non-missing, it will be important to understand, as best is possible, the mechanism for missingness. If this mechanism varies over time, for instance, use of such a variable could be ruinous.
In a similar vein, numeric predictors are sometimes preprocessed through binning, with the original values being replaced by the bin mean of the dependent variable. Missing variables may be accommodated by treating them as one more bin. This is a handy trick for linearizing the relationship between individual predictors and the dependent variable, but the same warning as above applies here: A change in the mechanism by which missing values are generated will disrupt such treatment.
---
I'd like to suggest an additional reference: Sampling: Design and Analysis by Sharon Lohr (ISBN 0-534-35361-4), which is specifically on sampling and surveying, but which includes an entire chapter on "nonresponse".
Thanks for your comments Will and Ralph.
Concerning the "need for guessing", I was more thinking of guessing what people do in their work, since generally it is not clear. If you want to compare your results on some DM algorithm (using a standard UCI data set, for example) with other papers in the literature, you need to know how they treated missing values.
If nothing is explicitly stated, should we infer that the authors of the research simply ignore missing values?
"Ignored" them how, though? Logical methods may ignore missing values outright, but numerical methods need a number to perform calculations. If missing values are ignored with numerical methods, either observations or variables will be ignored- which one is vitally important.
It is a problem I have found in past research just what Sandro alludes to -- often researchers don't indicate clearly how they handle missing data (or outliers or other data anomalies for that matter).
Regarding the subject of missing data, as Will writes it is a well-studied problem, though solutions I have found are data and application dependent. It all depends on the information that one wants to convey to the learning algorithms. It is also a reason why I am very skeptical of any software that claims to require no data prep (because it does everything automatically for you).
There is much to be said for imputation methods like what CART does for instance, but for principles of handling missing data, a good overview can be found here
Post a Comment