Drastic shift in feature importance upon adding other features

I have a model (GBDT) where adding a feature X is not important (according to SHAP), but when I add other features, and add X again, now feature X is the second most important!

What could explain that? How do I investigate what is going on?

Topic predictor-importance feature-engineering xgboost feature-selection

Category Data Science


There's a good chance that it's a sign of overfitting: the fact that the importance of the features is not stable can be considered as an indication that the model itself is not stable, and this typically happens when it doesn't have enough information in the data to be sure how to use the features. As a result minor variations in the features or data cause the model to change a lot, because it uses features which occur by chance. A way to investigate is to reduce the number of features: if the model becomes more stable this way, it confirms overfitting (and the performance on the test set should be the same or better).

[edited] There is also a possibility that the new features are really useful for the model, causing it to use the whole set of features in a very different way since it can take advantage of new combinations of features. For instance, suppose we have a model which predicts a disease based on features which represent the patient's symptoms, then we add features which represent the patient's age and gender. Let's assume that having a particular symptom at a particular age is a strong indicator of the disease, so this would mean that this symptom feature was not very useful on its own, but becomes more important with the additional feature. In such a case I would probably expect to observe a significant improvement in performance on the test set when the new features are added.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.