Difference between DataSampler and TestAndScore widgets in Orange

I'm trying to figure out the difference between using the data sampler to get a 70/30 train/test split and directly using the test and score widget to do so via random sampling. My workflow as follows

This is how my test and score widget looks for the case without the data sampler

and this is how my data sampler widget looks

I see very different results in the confusion matrix at the end between the two. Using the data sampler, I get a much better model than without it. However, if I directly try and leverage the train_test_split function with LogisticRegression in scikit-learn with similar hyper params as Orange(e.g. for solver, C, class_weights etc.) My results there are much closer to what I see in Orange without using data sampler.

  • Could someone help me figure out what I'm missing?

  • What is the difference between the two widgets in how I'm using them?

  • Does the 70% in DataSampler not correspond to a train_test_split function in scikit-learn with train_size=0.7?

Topic orange3 orange

Category Data Science


Figured out the issue, I'd mis-configured the channels between the 'data sampler' and 'test and score' widgets to use only training data for testing. It should in fact be configured this way instead-->

enter image description here

Earlier I had 'Data Sample' from the 'Data Sampler' mapped to both 'Data' and 'Test Data' in the 'Test and Score' widget. 'Remaining Data' is what needs to be mapped to 'Test Data' as that is the 'out of sample' data set for testing.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.