Why is XGBClassifier in Python outputting different feature importance values with the same data across different repetitions?

Question

Why is XGBClassifier in Python outputting different feature importance values with the same data across different repetitions?

user15733888

2022年5月31日 09:01

I am fitting an XGBClassifier to a small dataset (32 subjects) and find that if I loop through the code 10 times the feature importances (gain) assigned to the features in the model varies slightly.

I am using the same hyperparameter values between each iteration, and have subsample and colsample set to the default of 1 to prevent any random variation between executions. I am using the scikit learn feature_importance_ function to extract the values from the fitted model.

Any ideas as to why this variation in feature importance could be occurring? Does this mean that some of my features may be correlated and is there a way to ensure the XGBoost outputs the same importance values each time it is called? Note that the prediction and the predicted probabilities are all constant across iterations: it is just the feature importances.

Thanks in advance!

Topic feature-importances xgboost feature-selection python machine-learning

Category Data Science

Wickkiey · Accepted Answer · 2021年4月23日 02:04

According to the XGBClassifier parameters (link) some operations will be happens on top of randomness, like subsample feature_selector etc.

If we didn't set seed for random value everything different value will be chosen and different result we will get. (Not abrupt change is expected).

So to reproduce the same result, it is a best practice to set the seed parameter in XGBoost Classifiers.

Most of the SkLearn classes will have random_state parameter for similar purposes.

Nitin · Accepted Answer · 2021年4月22日 23:15

Look at the 'max_features' parameter in the GradientBoosting Classifier in scikit learn, here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier

That means that the algorithm is fitting different trees to try and account for the residual information after each sequence of trees, and that the features for each tree can be randomly permuted based on this parameter's selection.

Scikit learn does say that this can be made deterministic by setting random_state...

Why is XGBClassifier in Python outputting different feature importance values with the same data across different repetitions?

About