Why can't we sample without replacement for each tree in a random forest if the subsample size is large enough?

Question

Why can't we sample without replacement for each tree in a random forest if the subsample size is large enough?

user9343456

2021年5月23日 18:41

Usually if we have $n$ observations, for each tree with form a bootstrapped subsample of size $n$ with replacement. On googling it one common explanation I've seen is that with replacement sampling is necessary for independence of individual trees.

But why can't we just resample as follows: for tree 1, randomly sample $m$ observations without replacement out of the $n$, where $m$ is still large enough (of course, provided that $n$ is large enough in the first place). Then replenish all observations and repeat the resampling for tree 2, and so on.

Even in this case, I'd imagine that the individual subsamples would be independent. So is there an additional reason for resampling with replacement in bagging?

Topic bagging random-forest

Category Data Science

Nikos M. · Accepted Answer · 2021年5月23日 18:41

No, the samples will not be independent, there is possibility the data samples will be skewed.

For example, imagine a class-imbalanced binary problem, once the minority class is already sampled (large possibility that this can happen given $n$ and $m$) then, without replacement, the rest trees will only sample from the majority class which will produce skewed trees.

Some references:

“Bagging based on resampling with and without replacement is equivalent”, is it?

For random forests, generally, the concept of replacement is considered essential. This is because the underlying concept of random forests is bagging to prevent overfitting, i.e., bagging builds an ensemble of estimators trained on data with a high variance (with regard to the training data they have seen).

Why does random forest use sampling with replacement instead of without replacement?

The basic idea of bootstrapping is that you use your sample as a population. And from it you sample repeatedly, with replacement, to build other samples of the same size as your original sample.

Replacement is an integral part of this process because you are trying to create other possible sample distributions which could have come for your original population based on the sample you have.

Why can't we sample without replacement for each tree in a random forest if the subsample size is large enough?

About