Random Forest but keep only leaves with impurities below a threshold

Is there an algorithm out there that creates a random forest but then prunes all the leaves that have an impurity measure above a certain threshold that I would determine?

In other words, if I set min samples per leaf to be 500 and leaves have to have at least a 90% purity for example, the algorithm would only keep leaves that respect these parameters.

My dataset is extremely noisy so most leaves have a gini impurity around 0.5 but some leaves are almost around 0. I care only for the latter in my use case. Is there an algorithm that does something like what I described?

Topic lightgbm xgboost gbm random-forest machine-learning

Category Data Science


If some leaves are pruned then it means that the model cannot predict the instances which would normally have fallen into these leaves. The logic of a decision tree is to represent every possible instance: like any supervised learning method, it must be able to predict for every possible instance as input. In some cases where all the leaves have low purity, this condition would even make the model unable to predict anything at all. This is why there is no such algorithm, at least not as a regular supervised learning algorithm.

However you can rely on the probability predicted for the instance instead, this gives you exactly the same information in a way consistent with standard ML.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.