Missing data

From Knowledge Discovery

Jump to: navigation, search

Data can be missing at random, in which case a good first order approximation is to put the mean of each feature in for the missing value.

Often, however, data is not missing at random. In this case, in general, you add an indicator function indicating that the variable is missing, or you add one more category to your different categorical variables (e.g. {red, green, blue, missing}).


This, in a regression setting a feature x would be replaced with two features:

x_a = if (missing(x)) then 0 else x
x_b  = if (missing(x)) then 1 else 0


see also: http://www.lshtm.ac.uk/msu/missingdata/start.html

Personal tools