Fix first two levels of decision tree?

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals:

  1. Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will generate 264 trees.

  2. Build a basic tree with two variables namely country and product type with appropriate cp value to generate all combination as leaf nodes (264). Build a second tree with rest all variables and stack tree one upon tree two as a single decision tree.

I do not think the first one is the right way to do it. Also, I am stuck on how to stack the trees in second approach, even if it is not the right way would love to know how to achieve this.

Please guide me on how to approach this problem.

Topic decision-trees predictive-modeling r machine-learning

Category Data Science


I think you could do this fairly automatically if you're open to using Python. A library called auto_ml* has a feature called categorical ensembling, where you can explicitly say "I want a model built for each level of this feature". If you made a feature that was country-product type and used that as your category, the rest should be pretty easy.

*Disclosure: I've made minor contributions to auto_ml. It is FOSS under the MIT license.


Depending which tree algorithm you want to use you could manually construct the two first levels of the tree. You can just follow the pseudo code explained for example here for the C4.5 tree. Once you have done this you can remove the two features from the data set and create trees for the remaining part of the tree. If you want to create a rpart object you would be required to take some parts of the source and this may be a bit more demanding. Depending on what tree algorithm you use you will just have a binary split at both levels so you will only need to build 4 separate trees and not 264. Note that you may not have the optimal decision tree since after stepping through the first two levels, the country and product type may still be variables that cause a split. But without seeing the data is impossible to tell.

Side note, it may be valuable to explain the business that country and product type are not the most sensible variables to have in the top of the decision tree. Sometimes it is better to educate the end users than to force machine learning to do something inaccurate. In my experience end users prefer to have a correct solution than a solution that works because people have a gut feeling that it should be in a certain way.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.