How to evaluate data imputation techniques

I have a data set with 29 features 8 if them have missing values.

I've tried Sklearn simple imputer and all it's strategies KNN imputer and several Number of K Iterative imputer and all combinations of imputation order , estimators, number of iterations.

My question is how to evaluate the imputation techniques and choose the better one for my Data.

I can't run a base line model and evaluate it's performance because I'm not familiar with balancing the data and tunning the parameters and all models are giving poor scores.

There's why I'm searching if there's another way to evaluate the data imputation techniques Something like evaluation of distribution or something like that. I'm very newbie btw so pardon my stupidity

Topic imbalanced-learn data-imputation scikit-learn classification dataset

Category Data Science


Suggestion: One should never arbitrarily select an imputation based upon optimization unless you know (or have a hunch) WHY the data is missing. N/A data is either Missing at random, Missing not at random, or Missing completely at random. Ignoring this may help your optimization, but will miss the mark for choosing an appropriate model. There are also some statistical tests that can help you decide. Only exception might be if you are in a data science competition where the rules allow it, but that is still not best data science or statistical practice.

Here is a Wikipedia article to help you get started.

https://en.wikipedia.org/wiki/Missing_data#Types


First, there is nothing wrong with asking such question. Second, the most straightforward way to select an optimal preprocessing step (whether it is an imputation or something else) is to use a validation set. Split your dataset into 3 parts: training (train the model, estimate model parameters, e.g. weights of a linear regression), validation (compare different models, e.g., one with one data imputation strategy and another with another imputation strategy), and test (this always exists for you to check if you messed up somewhere miserably).

If you have drastically different results on validation and test sets, it means that most likely you overfit.

If you are interested on selecting an optimal imputation technique your models should only differ in this particular step (everything else should be the same). In this case the model with the best validation score will be the one with the optimal imputation.

PS In most real settings you do not want to use any imputation and instead you want to encode the fact that the value is missing (most of the advance implementations, e.g. xgboost are doing this job for you).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.