Scikit-learn's implementation of AdaBoost

I am trying to implement the AdaBoost algorithm in pure Python (or using NumPy if necessary).

I loop over all weak classifiers (in this case decision stumps), then over all features, and then over all possible values of the feature to see which one divides the dataset better. This is my code:

for _ in range(self.n_classifiers):
    classifier = BaseClassifier()
    min_error = np.inf

    # greedy search to find best threshold and feature
    for feature_i in range(n_features):
        thresholds = np.unique(X[:, feature_i])

        for threshold in thresholds:
            # here we find the best stump
            error = sum(w[y != predictions])
            if error  min_error:
                min_error = error

The first two loops are not a problem at all, since we normally have some tens of classifiers and features at most. But the third loop causes the code to be very inefficient.

One way to solve this is to ignore the best weak classifier and just choose one with slightly better performance than a random classifier (as suggested in the Boosting: Foundations and Algorithms by Robert E. SchapireYoav Freund, p. 6):

for _ in range(self.n_classifiers):
    classifier = BaseClassifier()
    min_error = np.inf

    # greedy search to find best threshold and feature
    for feature_i in range(n_features):
        thresholds = np.unique(X[:, feature_i])

        for threshold in thresholds:
            # here we find the best stump
            error = sum(w[y != predictions])
            if error  0.5 - gamma:
                min_error = error
                break

But in this case, the accuracy of my model is lower than that of Scikit-learn, and running time is still three times.

I tried to see how Scikit-learn implemented AdaBoost the code was not clear for me. I really appreciate any comment.

Topic adaboost implementation decision-trees scikit-learn python

Category Data Science


The sklearn implementation of AdaBoost takes the base learner as an input parameter, with a decision tree as the default, so it cannot modify the tree-learning algorithm to short-circuit at a "good-enough" split; it will search all possible splits. It manages to be fast at that because the tree learning is done in Cython.

Another option for improved speed, if you want to stay in pure python: do histogram splitting, as pioneered by LightGBM and now incorporated into XGBoost and sklearn's HistGradientBoosting models.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.