how are split decisions for observations(not features) made in decision trees

Question

how are split decisions for observations(not features) made in decision trees

Naveen Reddy Marthala

2020年12月9日 15:34

I have read a lot of articles about decision trees, and every one of them only focused on telling how a feature/column is considered for split, based on criterion like gini index, entropy, chi-square and information gain. But, not one talked about the observation part.

Example: Let's say I have a dataset with 3 independent features and 1 discrete target variable, namely height_in_cm(like 130, 140), performance_in_class(like below average, average, very good), class(like 7th, 8th or 10th class) and plays_cricket(1 for yes 0 for no) is the target variable. So, with target variable as the root node and then for splits, I may try all the features iteratively and settle with the one that I had most information gain with or with most pure nodes. ex, let's start with any variable, let's take the first one, height_in_metres and after two splits with two child nodes, like height 120 in one child node and another child node with height 120 and i will then calculate gini impurity and they turned out to be 0.45 and 0.49 respectively.

Questions:

just like i iteratively try all the feature combinations, do i also need to try all the combinations of a feature split, for above case, height 100 height 100, then height height 110 height 90 height 90 and so on. How to do this and what's the efficient way.
just like there are metrics like gini impurity and entropy to measure the quality of a feature split, are there any metrics to measure the quality of split based on observations?

Topic cart machine-learning-model decision-trees scikit-learn machine-learning

Category Data Science

Carlos Mougan · Accepted Answer · 2020年12月9日 15:34

Do i also need to try all the combinations of a feature split, for above case, height < 100 & height > 100, then height < & height > 110 & height < 90 & height > 90 and so on

Suppose that you only have 3 posible values on height 80,90,100. Then the decision tree will try to split on if X >= 80,X >= 90 and X >= 100. If the instance does not satisfy the condition will fall on the other leaf.

So if the split is above the threshold it will fall into another leave, if its below it will fall into another. And then there recursively.

So the answer to this is NO. It will only try on of the logical conditions -- >= for example. I believe this depends on the implementation.

I encourage you to build a decision tree from scratch to fully understand the algorithm.

I reccomend you this video https://www.youtube.com/watch?v=y6DmpG_PtN0 but there is a lot of tutorials there. Choose the one that suits you most.

how are split decisions for observations(not features) made in decision trees

About