The Differences Between Weka Random Forest and Scikit-Learn Random Forest

I have used both weka random forest and sklearn random forest in my research, but I have realised that they use different methods to combine the predictions of the base learners i.e. decision trees to make the final prediction. To predict the class of an instance, weka random forest uses majority vote which predicts the class of the instance as the class predicted by majority of the decision trees. The class probability of the instance is computed as fraction of the no. of the trees that predict that class to the total no. of the trees of the random forest.

The sklearn random forest predicts the class of an instance as follows. The predicted class of an input instance is the class with the highest mean probability estimate across the trees. The predicted class probabilities of an input instance are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

I think the class probability of weka random forest (majority vote) is correct but the class probability of sklearn random forest does not seem correct. Because for sklearn random forest, the sizes of classes of the training set determines the class probabilities of a single tree and the class probabilities of the random forest. So, removing instances from or adding instances to the training set would change the class probabilities of the random forest which does not seem correct. The majority vote is not affected by the sizes of the classes of the training set.

However, the cross validation performances of weka random forest and sklearn random forest are similar. For example, the 10-fold cross validation results of these 2 random forest approaches on 5 datasets from UCI database are as follows:

dataset 10-fold CV AUC of weka random forest 10-fold CV AUC of sklearn random forest
diabetes 0.82 0.84
ionosphere 0.98 0.97
sonar 0.92 0.93
wdbc 0.99 0.99
spectf 0.99 0.97

I did another experiment for the diabetes and sonar datasets using weka random forest and sklearn random forest respectively:

  1. split the dataset into a training set (80%) and a test set (20%) using stratified sampling.
  2. Oversample the training set into a balanced data set of size 1000 using random sampling with replacement.
  3. Train a random forest with 50 trees
  4. Test the random forest on the test set
  5. Repeat steps 1 to 4 for 100 times

The results of weka random forest and sklearn random forest are very similar in terms of average testing AUC over 100 times.

dataset average test AUC of weka random forest average test AUC of sklearn random forest
diabetes 0.817 0.818
sonar 0.933 0.928

Q: Why the performances of weka random forest and sklearn random forest are similar but they use different methods to compute class probabilities of an input instance?

Another main difference is that sklearn random forest cannot be applied to discrete features. The discrete features must be transformed to numeric features by one-hot encoding or ordinal encoding before applying sklearn random forest. Weka random forest can be applied to both numeric features and discrete features directly.

Thanks David

Topic weka random-forest scikit-learn

Category Data Science

+1 to Craig for the answer to the actual question. But I want to address two other remarks from your post.

...the class probability of sklearn random forest does not seem correct. Because for sklearn random forest, the sizes of classes of the training set determines the class probabilities of a single tree and the class probabilities of the random forest. So, removing instances from or adding instances to the training set would change the class probabilities of the random forest which does not seem correct. The majority vote is not affected by the sizes of the classes of the training set.

Removing/adding enough instances to the training set will tip the balance in some leaves, resulting in a change in the class probabilities of the random forest as well. So the sklearn method is just "more continuous" than the weka method. These two methods are sometimes referred to as "soft" and "hard" voting, e.g. both are available in the VotingClassifier class of sklearn. I suspect most of the time the final results are similar, although do note that in a highly imbalanced setting, the hard votes of each tree may never come up with the minority class, so soft voting may be preferable as a way to come up with unbiased probabilities.

Another main difference is that sklearn random forest cannot be applied to discrete features. The discrete features must be transformed to numeric features by one-hot encoding or ordinal encoding before applying sklearn random forest. Weka random forest can be applied to both numeric features and discrete features directly.

There are actually two layers to this difference. The first is the historical development of two main tree algorithms, CART and the Quinlan family (ID3 then C4.5 and C5.0). Quinlan family trees split categorical features by creating one child per category, whereas CART always produces binary trees. But, there are implementations of CART that produce binary splits of categorical data without encoding; the second layer of the difference is that sklearn does not implement that in random forests. They do have an implementation for that categorical splitting in their HistGradientBoosting classes, which presumably at some point will get ported over to a random forest setting as well.

For the question -

Why the performances of weka random forest and sklearn random forest are similar but they use different methods to compute class probabilities of an input instance?

Often different algorithms will have similar results. This is not surprising. If you run the data through a GBM or logistic regression (with the proper feature engineering) you may get very similar results. At the model metric view.

The data is separable. Many algorithms will find something very similar. They are all looking for the separation.

The question though is where you want to use the model, how are the results. For example, the AUROC might be very similar but AUROC is no sensitive to where you want to use the model and the relative size of errors. Size of errors might be if you are predicting something with money, then being wrong on a large money transaction might be much worse then being wrong on a low money transaction. Sometimes the different algorithms might do a little better with individual observations even though the overall model metric is the same.

Specific problems, such as NLP and image classification, very often work better with algorithims designed for this problem. Neural Nets with the appropriate architecture rather should be better than a RF or GBM for these problems.

An interesting paper on this subject is by David J. Hand here.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.