Scikit-learn's implementation of AdaBoost
I am trying to implement the AdaBoost
algorithm in pure Python
(or using NumPy
if necessary).
I loop over all weak classifiers (in this case decision stumps), then over all features, and then over all possible values of the feature to see which one divides the dataset better. This is my code:
for _ in range(self.n_classifiers):
classifier = BaseClassifier()
min_error = np.inf
# greedy search to find best threshold and feature
for feature_i in range(n_features):
thresholds = np.unique(X[:, feature_i])
for threshold in thresholds:
# here we find the best stump
error = sum(w[y != predictions])
if error min_error:
min_error = error
The first two loops are not a problem at all, since we normally have some tens of classifiers and features at most. But the third loop causes the code to be very inefficient.
One way to solve this is to ignore the best weak classifier and just choose one with slightly better performance than a random classifier (as suggested in the Boosting: Foundations and Algorithms by Robert E. SchapireYoav Freund, p. 6):
for _ in range(self.n_classifiers):
classifier = BaseClassifier()
min_error = np.inf
# greedy search to find best threshold and feature
for feature_i in range(n_features):
thresholds = np.unique(X[:, feature_i])
for threshold in thresholds:
# here we find the best stump
error = sum(w[y != predictions])
if error 0.5 - gamma:
min_error = error
break
But in this case, the accuracy of my model is lower than that of Scikit-learn
, and running time is still three times.
I tried to see how Scikit-learn
implemented AdaBoost
the code was not clear for me. I really appreciate any comment.
Topic adaboost implementation decision-trees scikit-learn python
Category Data Science