How to print feature names in conjunction with feature Importance using Imbalanced-learn library?

I used BalancedBaggingClassifier from imblearn library to do an unbalanced classification task. How can I get feature improtance of the estimator in conjunction with feature names especially when the max_features are less than the total number of features? For example, in the following code total number of features equal to 20 but max_features are 8.

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.ensemble import BalancedBaggingClassifier
from xgboost.sklearn import XGBClassifier

X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))

ln = X.shape
names = ["x%s" % i for i in range(1, ln[1] + 1)]

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
bbc = BalancedBaggingClassifier(base_estimator=XGBClassifier(n_estimators=10,min_child_weight=5,max_depth=3,learning_rate=0.02,colsample_bytree=1,subsample=1,scale_pos_weight=.26), ratio='all',random_state=0, max_features=8)
bbc.fit(X_train,y_train)
for estimator in bbc.estimators_:
    print(sorted(zip(map(lambda x: round(x, 4), estimator.steps[1][1].feature_importances_),names), reverse=True))

I think there is a problem with the above code because always printed features are named x1 to x8 while for example, feature x19 may be among the most important features.

Thanks.

Topic imbalanced-learn xgboost scikit-learn classification python

Category Data Science


The issue is that you are not introspecting properly the feature importances.

for x in bbc.estimators_:
    print(x.named_steps['classifier'].feature_importances_)

[ 0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  1.  0.  0.]
[ 0.  1.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  1.  0.  0.  0.  0.  0.]
[ 1.  0.  0.  0.  0.  0.  0.  0.]
[ 0.   0.   0.   0.5  0.   0.   0.   0.5]
[ 0.  0.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  1.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  1.  0.  0.  0.  0.  0.]

As you can see, the XGBoostClassifier reports the feature importances only on 8 features as defined in the parameters list of the BalancedBaggingClassifier. Those 8 features presented to each XGBoostClassifer are in fact randomly selected for each estimator of the ensemble. So you have to first find which features were subsampled and given to each XGBoostClassifier and then indirectly find the real feature importances.

names = np.array(names)
for x, feat_sel in zip(bbc.estimators_, bbc.estimators_features_):
    feat_imp =  np.nonzero(x.named_steps['classifier'].feature_importances_)
    print(names[feat_sel[feat_imp]])

['x11']
['x11']
['x7']
['x11']
['x11']
['x16' 'x3']
['x7']
['x6']
['x7']
['x7']

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.