Random Forest Stacking Experiment for Imbalanced Data-set Problem

Question

Random Forest Stacking Experiment for Imbalanced Data-set Problem

Aman Raparia

2020年4月14日 00:07

In order to solve a Imbalanced Dataset Problem, I experimented with Random Forest in the given manner (Somewhat inspired by Deep-Learning)

Trained a Random Forest which will take in the input data and the predict probability of the label of the trained model will be used as a input to train another Random Forest.

Pseudo Code for this :

train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2)
rf_model = RandomForestClassifier()
rf_model.fit(train_X, train_y)
pred = rf_model.predict(test_X)
print('******************RANDOM FOREST CM*******************************')
print(confusion_matrix(test_y, pred))
print('******************************************************************')
predict_prob = rf_model.predict_proba(X)


X['first_level_0'] = predict_prob[:, :1].reshape(1,-1)[0]
X['first_level_1'] = predict_prob[:, 1:].reshape(1,-1)[0]

train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2)
rf_model = RandomForestClassifier()
rf_model.fit(train_X, train_y)
pred = rf_model.predict(test_X)

print('******************RANDOM FOREST 2 CM*******************************')
print(confusion_matrix(test_y, pred))
print('******************************************************************')

And I was able to see considerable improvement in the recall. Is this approach mathematically sound. I used the second layer of the Random Forest such that it would be able to correct the error by the first layer. Just looking to combine the principle of boosting to Random Forest Bagging Technique.Looking for thoughts.

Topic bagging boosting random-forest scikit-learn

Category Data Science

Ben Reiniger · Accepted Answer · 2020年4月14日 00:07

The underlying idea is fine, but you've fallen into a common data leakage trap. By recombining the data and then resplitting, your second model's test set includes some of the first model's training set. The first model knows the labels on those datapoints and, especially if overfit, passes along that information in its predictions. So the score you see for the ensemble is probably optimistically biased.

The most common approach to fixing this is to use k-fold cross-validation to produce out-of-fold predictions on the entire training dataset for the second model.

Note that sklearn now has such stacked ensembles builtin:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html

Random Forest Stacking Experiment for Imbalanced Data-set Problem

About