Boosting with highly correlated features

I have a conceptual question. My understanding is, that Random Forest can be applied even when features are (highly) correlated. This is because with bagging, the influence of few highly correlated features is moderated, since each feature only occurs in some of the trees which are finally used to build the overall model.

My question: With boosting, usually even smaller trees (basically "stunps") are used. Is it a problem to have many (highly) correlated features in a bagging approach?

Topic bagging boosting random-forest

Category Data Science


Actually, your understanding of a random forest is not 100 percent correct. Variables are sampled per split, not by tree. So every tree has access to all variables.

In general, tree based models are not too strongly affected by highly correlated features. There are no numeric stability issues as with least squares. You can easily add a variable twice without numeric problem. Note however that most interpretability tools like split importance or partial dependence plots are affected by multicollinearity. So be careful with them in such cases.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.