Why is rpart not splitting this data even when there is gain in gini?

df - tibble(x1=factor(c(S1, S1, S2, S2)), y=factor(c(1, 1, 0, 1)))
md - rpart(formula=y~., data=df, method=class, control=rpart.control(minsplit=2, cp=0))
nrow(md$frame) #outputs 1

Consider the split

left child node:

S1, 1

S1, 1

Right child node:

S2, 0

S2, 1

Here the gain in gini would be ${1 \over 8} = 0.125$

Why is rpart not doing this split?

Topic decision-trees r

Category Data Science


It seems that rpart is actually using accuracy rather than gini in the cost complexity pruning, see e.g. https://stats.stackexchange.com/a/223211/232706

Since your split doesn't improve the misclassification rate, rpart doesn't make it even with cp=0. Setting cp=-1, the split is performed.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.