XGBoost Feature Importance, Permutation Importance, and Model Evaluation Criteria
I have built an XGBoost classification model in Python on an imbalanced dataset (~1 million positive values and ~12 million negative values), where the features are binary user interaction with web page elements (e.g. did the user scroll to reviews or not) and the target is a binary retail action. My ultimate goal was not so much to achieve a model with an optimal decision rule performance as to understand which user actions/features are important in determining the positive retail action.
Now, I have read quite a bit in forums and literature about evaluating/optimizing an XGBoost model and subsequent decision rule, which I assume is required before achieving my ultimate goal. It seems that there are a lot of different ways to evaluate the decision rule part (e.g. Area Under the Precision Recall Curve, AUROC, etc) and the model (e.g. log-loss). I believe that both AUC and log-loss evaluation methods are insensitive to class balance, so I don't believe that is a concern. However, I am not quite sure which evaluation method is most appropriate in achieving my ultimate goal, and I would appreciate some guidance from someone with more experience in these matters.
Edit: I did also try permutation importance on my XGBoost model as suggested in an answer. I saw pretty similar results to XGBoost's native feature importance. Should I now trust the permutation importance, or should I try to optimize the model by some evaluation criteria and then use XGBoost's native feature importance or permutation importance? In other words, do I need to have a reasonable model by some evaluation criteria before trusting feature importance or permutation importance?
Topic predictor-importance xgboost evaluation classification
Category Data Science