Data Science: Data Imputation Models Using Linear Regression

There are many ways to deal with missing values in data sets. For continuous data, we may impute the missing values to the mean or the median. For categorical values we may impute missing values to the dominant category. There is also an additional method for continuous values to test out.

Consider a scenario where we have missing values in a column. We could write a linear regression model to predict these missing values from other data in the data set. The issues one would encounter is that there may be multicollinearity between the imputed column and the independent variables. This would be problematic if a large percentage of the data is missing. It may be worthwhile to leave out that feature in a predictive model because the predicted values would be linear combinations of other features.

The end state is to feed your final model enough data with enough examples and produce variability for your target variable model to pick up on. Once that is completed, the imputation model you use to fill in the missing values for the training data set must be the same model for your validation set to fill in missing values before validation.

With the option presented in the blog post, the challenge to the reader is to test, test, test. Find a model or imputation method that reduces your error rate in training and doesn't degrade during validation. You might find additional value in dummy coding your categories and using them in the imputation model as well. Good luck!

Data Science

Thursday, July 11, 2019

Data Imputation Models Using Linear Regression

1 comment: