How to use a set of pre-defined classifiers in Adaboost?

Question

How to use a set of pre-defined classifiers in Adaboost?

Katatonia

2021年9月19日 23:29

Suppose there are some classifiers as follows:

dt = DecisionTreeClassifier(max_depth=DT_max_depth, random_state=0)
rf = RandomForestClassifier(n_estimators=RF_n_est, random_state=0)
xgb = XGBClassifier(n_estimators=XGB_n_est, random_state=0)
knn = KNeighborsClassifier(n_neighbors=KNN_n_neigh)
svm1 = svm.SVC(kernel='linear')
svn2 = svm.SVC(kernel='rbf')
lr = LogisticRegression(random_state=0,penalty = LR_n_est, solver= 'saga')

In AdaBoost, I can define a base_estimator and also the number of estimators. However, I want to use these 7 classifiers. In other words, n_estimators=7 and these estimators are above ones. How can I define this model?

Topic adaboost scikit-learn machine-learning

Category Data Science

desertnaut · Accepted Answer · 2021年9月19日 23:29

In practice, we never use any of the algorithms you list as base classifiers for Adaboost except for decision trees.

Adaboost (and similar ensemble methods) were conceived using decision trees (DTs) as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is a good reason why still today if you don't specify explicitly the base_classifier argument in scikit-learn's AdaBoost implementation, it assumes a value of DecisionTreeClassifier(max_depth=1) (docs).

DTs are suitable for such ensembling because they are essentially unstable classifiers (this is also the reason they succeed as base classifiers in Random Forests, while you have never heard of "Random kNNs" or "Random SVMs"); this is not the case with SVM, kNN, or linear models, let alone models that they are themselves ensembles, like Random Forests and Boosted Trees (XGboost). Notice the following remark in the seminal paper by legendary statistician (and RF inventor) Leo Breiman on Bagging Predictors:

Unstability was studied in Breiman [1994] where it was pointed out that neural nets, classification and regression trees, and subset selection in linear regression were unstable, while k-nearest neighbor methods were stable.

None of these algorithms (except Decision Trees) is expected to offer much when used as base classifiers for Adaboost (something you seem to have already discovered yourself, judging from the comments in the other answer). Attempting to use them simply because the framework (here scikit-learn) superficially allows us to do so is not a reason to do it.

See also the related Stack Overflow threads:

Multivac · Accepted Answer · 2021年4月6日 23:17

One possible solution is using a Stacking classifier as follows:

from sklearn.ensemble import StackingClassifier
# Assumed you already import  all the models you are using

dt = DecisionTreeClassifier(max_depth=DT_max_depth, random_state=0)
rf = RandomForestClassifier(n_estimators=RF_n_est, random_state=0)
xgb = XGBClassifier(n_estimators=XGB_n_est, random_state=0)
knn = KNeighborsClassifier(n_neighbors=KNN_n_neigh)
svm1 = svm.SVC(kernel='linear')
svn2 = svm.SVC(kernel='rbf')
lr = LogisticRegression(random_state=0,penalty = LR_n_est, solver= 'saga')

estimators = [
    ('rf', rf),
    ('svm1', svm1), ('svn2', svn2), ('xgb', xgb), ('knn', knn), ('lr', lr), ('dt', dt)
]

Option 1:

stacker = AdaBoostClassifier()   
model = StackingClassifier(
    estimators=estimators, final_estimator= stacker
)

Option 2:

base_model = StackingClassifier(
    estimators=estimators
)
model = AdaBoostClassifier(base_estimator = base_model)

Option 2 will be without question the most expensive* of both and as far as I understand, option 2 fits better what you are looking for.

Strategy:

How to use a set of pre-defined classifiers in Adaboost?

Option 1:

Option 2:

About