SHAP value analysis gives different feature importance on train and test set

Question

SHAP value analysis gives different feature importance on train and test set

pbk

2020年7月21日 21:02

Should SHAP value analysis be done on the train or test set?

What does it mean if the feature importance based on mean |SHAP value| is different between the train and test set of my lightgbm model?

I intend to use SHAP analysis to identify how each feature contributes to each individual prediction and possibly identify individual predictions that are anomalous. For instance, if the individual prediction's top (+/-) contributing features are vastly different from that of the model's feature importance, then this prediction is less trustworthy. Does this approach make sense?

Topic shap lightgbm features predictor-importance

Category Data Science

Carlos Mougan · Accepted Answer · 2020年7月21日 21:02

Since SHAP gives you an estimation of an individual sample (they are local explainers), your explanations are local(for a certain instance)

You are just comparing two different instances and getting different results. This is normal and can happen in train and test set. This doesn't mean also that your train and test set have bad split, they could be good split.

In the end SHAP is done to help you understand how the model behaves in a particular instance. It should be done where you are interested in understanding. I guess that you can also try to find what is the difference between train and test with shap values, but they are local explainers so you might not find much success.

I wouldn't say anything about the quality of predictions given the feature importance.

lcrmorin · Accepted Answer · 2020年1月24日 17:18

You have to make sure that the problem doesn't come from your data or your model :

Make sure that your data don't change significantly (same % of classes) but also general distribution / correlation of features, correlation between features and output.
Make sure that your model is not overfit on your train data.

Once you have made sure of that, the idea of using SHAP to look for outliers is interesting, but might not work at all, depending on your variables / problems.

SHAP value analysis gives different feature importance on train and test set

About