Random Forest Stacking Experiment for Imbalanced Data-set Problem
In order to solve a Imbalanced Dataset Problem, I experimented with Random Forest in the given manner (Somewhat inspired by Deep-Learning)
Trained a Random Forest which will take in the input data and the predict probability of the label of the trained model will be used as a input to train another Random Forest.
Pseudo Code for this :
train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2)
rf_model = RandomForestClassifier()
rf_model.fit(train_X, train_y)
pred = rf_model.predict(test_X)
print('******************RANDOM FOREST CM*******************************')
print(confusion_matrix(test_y, pred))
print('******************************************************************')
predict_prob = rf_model.predict_proba(X)
X['first_level_0'] = predict_prob[:, :1].reshape(1,-1)[0]
X['first_level_1'] = predict_prob[:, 1:].reshape(1,-1)[0]
train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2)
rf_model = RandomForestClassifier()
rf_model.fit(train_X, train_y)
pred = rf_model.predict(test_X)
print('******************RANDOM FOREST 2 CM*******************************')
print(confusion_matrix(test_y, pred))
print('******************************************************************')
And I was able to see considerable improvement in the recall. Is this approach mathematically sound. I used the second layer of the Random Forest such that it would be able to correct the error by the first layer. Just looking to combine the principle of boosting to Random Forest Bagging Technique.Looking for thoughts.
Topic bagging boosting random-forest scikit-learn
Category Data Science