How to decide on using xgboost with imputation or without it and keeping missing values?

I have a large genetic dataset that I am using xgboost on to score most likely disease causing genes - giving the genes a score between 0-1 of likelihood.

I try to avoid features with a lot of missing data but this can be hard for genetic data, the largest amount of missingness I have for a feature is roughly half of values in a feature column are missing.

Currently I run my xgboost model in 2 versions, one with random forest imputation of missing values and one without where xgboost is handling the missing data directly. The imputation model performs at an r2 of 0.7 and the model with missing values performs at 0.8 on nested cross-validation.

My question is how do I choose which version to take on to further work? Can I trust that the higher 0.8 r2 with missing data is because xgboost is finding patterns in the missingness and finding this useful? Are there rules around missing data I should be trying to abide by? I have a biology background so I'm unsure what is best practice from a data science perspective for handling missing data, most resources about this that I've found online conclude this is a case by case problem, which I find hard to interpret into what I should be specifically looking into. Any help would be appreciated.

Topic missing-data xgboost bioinformatics regression machine-learning

Category Data Science


The first question about missing data is always why is it missing?

Have you checked or know why the data is missing and whether it is MAR, MCAR or not missing at random?

If your data is MCAR imputation is generally fine and your lower test metric might simply indicate a suboptimal imputation strategy. In this case you could try MICE or similar more advanced imputations than simple median imputation.

That not imputing missing values actually improves your prediction might indicate that your data isn't missing completely at random however. In this case coding missing values might improve your performance and therefore be the best course of action.

As described besides simply looking at the performance metric of a properly validated test set also try to understand why data is missing and what that would mean.


First you should define a metric that suits the problem $R^2$ in your case.

Do a correct cross-validation and train test splits.

And then choose in the cross validation which option has the best results for your model (imputing missing or xgboost no imputing). This way you are doing an empirical experiment and selecting the best result.

Probably you want to have a look to sklearn pipeline to do that.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.