checking model stability - Performance for different class

I tried to do multi-class classification problem. The goal is to predict whether the match will be won by HomeTeam, AwayTeam or Draw. I did feature engineering from the attributes and finally came up with final data to train a classifier. I make sure that the data is balanced for all the 3 class.

To train a classifier I did XGB Classifier, Logistic Regression, SGD Classifier and Normal DNN(Tensorflow Estimator). I checked the metrics for all the classifiers and I am picking out the best one from the classifier.

Linear SGD Classifier Performance on Validation Set

     Class, Precision, Recall,    spe,       f1,      geo,      iba,      sup

      A       0.58      0.69      0.79      0.63      0.74      0.54       275
      D       0.51      0.61      0.66      0.55      0.63      0.40       338
      H       0.81      0.50      0.94      0.62      0.69      0.45       315
   Avg/mean   0.63      0.60      0.79      0.60      0.68      0.46       928        

Model Performance for Test Dataset

              pre       rec       spe        f1       geo       iba       sup

      A       0.87      0.55      0.97      0.67      0.73      0.51        84
      D       0.43      0.69      0.66      0.53      0.67      0.45        83
      H       0.80      0.69      0.86      0.74      0.77      0.58       139

We can see this model is stable over the class A and H but the precision is so poor for class D. I think because of a lack of feature the model is not performing well for class D. Though, I did several EDA and Feature Engineering to increase the recall for class D.

My question is, Is this model is considered stable?

Topic data-science-model machine-learning-model model-selection ensemble-modeling

Category Data Science


A consideration: I dont think A is stable since it has a huge difference between validation and test results.

Some questions before answer:

  1. Are you using cross validating with how many folds? The best result are the mean between folds results? What is the stddev? If stddev is big, it can tell us a history.

  2. Your folds are shuffled? I see lot of people folding when using scikit learn without shuffle since its not the default behavior.

  3. Is your test data unbalanced? if so, its normal the difference between test and validation results.

  4. How many examples are you using for test and how many fot test? Sometimes the stability of the model requires more examples.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.