Missing value Imputation in dataset

I have two separate files for Testing and Training.

In the training data, I am dropping rows that contain too many missing values .

But , In the test data , I cannot afford to drop the rows so I have chosen to impute the missing values using KNN approach .

My question is , to impute missing values in the test data using KNN , is it enough to consider only the test data ? As in , neighbors in the test data alone ?

Topic k-nn data-imputation data-cleaning machine-learning

Category Data Science


I agree with the previous answer that you could use models that handle missing values.

But if you are stuck on a particular model and NaNs are not handled by that mode, you are forced to impute data. kNN may not be the best way to impute data..at least it is not a common way. Instead use a simple neural net itself to predict the missing values. Alternatively a mean based on similar groups could more easily do the trick (see for e.g. https://www.kaggle.com/c/titanic/discussion/157929 - Missing Ages on the Titanic - Few perspectives from basic to the advanced for some of the advanced strategies (specific to the Titanic scenario)

If you are attempting a Kaggle competition, it is an accepted practice to mix train and test data to impute values. However if it is non-competition related application, I would not advice you to do so else there could be leaks


As a general rule of thumb you should avoid doing different things between your train and test dataset. As a second general rule of thumb you rarely want to use knn for missing value imputation.

One efficient way to deal with missing value in your case would be to use a model that can handle missing values, like a tree model. (decision tree, random forest, xgboost...).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.