How do I know that model performance improvement is significant?

Say I am running a Machine Learning model that produces a certain result (say accuracy of 80%). I now change a minor detail in my model (say, in a Deep Learning model, increase the kernel size in one convolutional layer) and run the model again, leading to an accuracy of .8+x.

My question is how I would determine which in-/decrease in performance allows me to say that the new network architecture is better than my old one? I assume that x=.0001 falls within a reasonable margin of error, whereas x=-.2 is a significant decrease in performance - however, the use of significant here would be purely colloquial without any scientific backing.

I understand that some kind of hypothesis testing would in theory be appropriate here, but as far as I know, these require multiple samples (i.e. running the network many times), which in case of large ML models which take sometimes days to train isn't really feasible.

Topic hypothesis-testing statistics machine-learning

Category Data Science


Your last paragraph is actually the answer but there is no need to train your model several times. You actually need to "validate/evaluate" your model several times which of course can be run on smaller sets.

  1. Have several validation set. Quantity is not the main major but make sure they cover the distribution of data specially all classes.
  2. run both models on those sets and keep results in two lists/arrays
  3. simple/practical solution: If the difference between their means is larger than their individual standard deviations, then you are good to go
  4. a bit more confident: You have two sets of numbers (performance of two models). use a simple anova and monitor the p-value to see if their differences is significant.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.