Problems with decision tree labeling of nodes

Decision trees as we know assigns label to the node based on majority class voting. I am curious to find that what could be the problems with such labeling schemes? Does it lead to overfitting the data?

Topic cart decision-trees random-forest

Category Data Science


Decision Tree does assign the label based on majority given the attribute test condition and its value.

Regarding the class label assignment-

In case DT has a longer depth, there might not be enough instance left for a certain branch/test condition/node . then this might not be the reliable estimation of the class label statistically. This is also called Data fragmentation problem.

so a DT with 50 nodes, at dept 10, for day = Humid there is only 1 instance left which is -ve. So Its assigned as -ve but there is not enough data ideally to support this.

One way to solve this is to dis-allow to grow the tree beyond a certain threshold in terms of number of node i.e. stopping condition.

Which also brings us to Over-fitting, Regarding Over-fitting- There is this classic Error vs number of nodes graph on train and test to show how over-fitting happens in DT.

As you can see in below graph, tree with more number of nodes has lower training error but while its being tested error is higher. The gap between test and training error is telling us that the tree is over-fitting/has captured the noise when tree size is growing.

enter image description here

Now Random Forest is a Assembly/forest of multiple Decision Trees. While classifying the example we take majority voting out of Trees.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.