Decision Trees change result at every run, how can I trust of my results?

Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, because I am not sure If it's correct to take the best result on the test set, or maybe I should consider an average of the results... I dont know. Example, scoring for gridsearchCV is accuracy. I obtain Grid scores on development set:

0.627 (+/-0.129) for {'max_features': 2}
0.558 (+/-0.152) for {'max_features': 3}
-- Best parameters: {'max_features': 2}
    Best score: 0.626876876876877 (this is the accuracy)

Using the best estimator on the test set, I obtain accuracy 0.83.. which I think it is only due to the case. In fact I try again and the result is this

Grid scores on development set:
0.584 (+/-0.126) for {'max_features': 2}
0.572 (+/-0.168) for {'max_features': 3}
-- Best parameters: {'max_features': 2}
    Best score: 0.5840215215215215
On the test set accuracy: 0.62 !!

So, how can I trust my results? Second, would not it be better to use CV on all data, instead of just splitting at the beginning the dataset in test and train just one time?

I read about random_state but the problem is that the results depend on which value I use.. example random_state = 2 --> accuracy 0.6 (on test set) random_state = 6 --> accuracy 0.79 (on test set) so basically this does not resolve my problem. How can I validate my model if I dont know which one to use?

Topic decision-trees cross-validation machine-learning

Category Data Science


how can I trust of my results?

You probably shouldn't trust your results, because the large variation is likely caused by overfitting. Basically your model is not very reliable.

My guess is that you have either too many features or not enough instances, or any combination of these two issues.

I am not sure If it's correct to take the best result on the test set, or maybe I should consider an average of the results

It is definitely not correct to take the best result. The average result is more representative of the true performance, but you should also provide the standard deviation.

Second, would not it be better to use CV on all data, instead of just splitting at the beginning the dataset in test and train just one time?

Yes, that's a good idea: as far as I understand, currently you're just manually running the program several times and obtaining different performance each time. You can indeed use Cross-validation instead.

I read about random_state but the problem is that the results depend on which value I use.. example random_state = 2 --> accuracy 0.6 (on test set) random_state = 6 --> accuracy 0.79 (on test set) so basically this does not resolve my problem. How can I validate my model if I dont know which one to use?

That's normal: the split function separates the training and test set randomly according to a random sequence. Setting the random seed to a particular value guarantees the exact same sequence every time, but you don't want that since it makes the splitting non-random.

Instead you should work on reducing the variation. Usually it's not possible to add instances, so you should probably try to reduce the number of features, or simplify them.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.