Problems with decision tree labeling of nodes

Question

Problems with decision tree labeling of nodes

christopher

2020年5月21日 05:48

Decision trees as we know assigns label to the node based on majority class voting. I am curious to find that what could be the problems with such labeling schemes? Does it lead to overfitting the data?

Topic cart decision-trees random-forest

Category Data Science

BlackCurrant · Accepted Answer · 2020年5月21日 05:48

Decision Tree does assign the label based on majority given the attribute test condition and its value.

Regarding the class label assignment-

In case DT has a longer depth, there might not be enough instance left for a certain branch/test condition/node . then this might not be the reliable estimation of the class label statistically. This is also called Data fragmentation problem.

so a DT with 50 nodes, at dept 10, for day = Humid there is only 1 instance left which is -ve. So Its assigned as -ve but there is not enough data ideally to support this.

One way to solve this is to dis-allow to grow the tree beyond a certain threshold in terms of number of node i.e. stopping condition.

Which also brings us to Over-fitting, Regarding Over-fitting- There is this classic Error vs number of nodes graph on train and test to show how over-fitting happens in DT.

As you can see in below graph, tree with more number of nodes has lower training error but while its being tested error is higher. The gap between test and training error is telling us that the tree is over-fitting/has captured the noise when tree size is growing.

Now Random Forest is a Assembly/forest of multiple Decision Trees. While classifying the example we take majority voting out of Trees.

Problems with decision tree labeling of nodes

About