Random selection of variables in each run of python sklearn decision tree (regressio )

When I put random_state = None and run Decision tree for regression in python sklearn, it takes different variables to build tree each time?

Shouldn't there be only few top variables which should be used to split and should throw me similar trees everytime?

Also, if I use integer for random_state and run the decision tree, it gives me a different tree for each random_state setting. Which tree should be selected in case of so many trees?

Topic cart decision-trees regression scikit-learn

Category Data Science


The documentation says:

random_state : int, RandomState instance, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.

So:

it takes different variables to build tree each time?

The randomness comes from picking different variables when building the tree, indeed.

Shouldn't there be only few top variables which should be used to split and should throw me similar trees everytime?

Not necessarily, this depends on the parameters and the data:

  • If an int is given as value for the parameter random_state, a particular random state is used every time so the tree is fixed, even though it's not necessarily the only one possible. Let's assume that the default value None is provided.
  • If the parameter max_features is anything other than None or auto, then a random subset of the features is selected before every split. Of course, this means that a particular variable might not be selected, so this might cause differences at every run. Let's assume that the default value None is provided.
  • If all the features can be used at every run, there's no randomness anymore in the selection before the split. But even if the default value best is provided for splitter, it would be wrong to assume that there is a single best variable at each node: it's possible that two (or more) different variables are tied, i.e. they would provide an equal improvement in the criterion chosen (by default mse).

Conclusion: assuming you used the default None for max_features, the fact that you obtain different trees at every run means that there are frequently ties between variables with your data. If there are only minor differences in the nodes at the bottom of the tree then it's not significant, but if there are many changes including close to the root of the tree then it means that the model is unstable. The latter case is usually a bad sign, possibly none of the features is really helpful to predict the value and the model overfits.

Also, if I use integer for random_state and run the decision tree, it gives me a different tree for each random_state setting. Which tree should be selected in case of so many trees?

The possibility to select a random state is not intended to select a particular tree, it's intended to make the experiment reproducible. So in my opinion you shouldn't select one of the trees like this. Instead of selecting an arbitrary tree, you could try to investigate why the model is unstable and fix it if possible. The first step is to actually evaluate the model(s) on some test data: if there is a problem such as overfitting, it will be visible in the performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.