Should I Impute target values?

I am new to data science and I am currently playing around a bit. Data exploration and preparation is really annoying. Eventhough I use pandas.

I achieved imputing missing values in independant variables. For numerical data by using the Imputer with the means strategy and for one categorical variable I used the Labelencoder and afterwards imputed with the mode strategy.

But now I face the issue that the dependant variable $y$ also contains missing values. Should I delete those lines or should I impute $y$ which is numerical.

Topic data-imputation preprocessing regression data-cleaning machine-learning

Category Data Science


Data Imputation of the target variable makes the model BIAS. A small correction is not to use label encoder for predictors. Label Encoder to be used for only target variables if they are categorical.

Deleting those records which have missing target variable can be your last option. See if you could collect more data.


For the missing data problem, one thing to be aware of, is the missingness mechanism. Depending of the dataset, the NA's (Missing Values) you have could be a result of a condition of the phenomenon and you shouldn't impute then using mean but maybe.

Besides, for the dependent variable, if you want to train a model with the independent ones to predict this, let's say Y, you wouldn't train a model using this observation with NA on the dependent (target?). Then, you would drop this lines or maybe using another technique which takes into account the dependence of the other variables.

I think a good start is to give look at this: Missing-data imputation

It shows the limitations of using some approaches like yours and defines the mechanisms of missing data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.