Rate of convergence - comparison of supervised ML methods

I am working on a project with sparse labelled datasets, and am looking for references regarding the rate of convergence of different supervised ML techniques with respect to dataset size.

I know that in general boosting algorithms, and other models that can be found in Scikit-learn like SVM's, converge faster than neural networks. However, I cannot find any academic papers that explore, empirically or theoretically, the difference in how much data different methods need before they reach n% accuracy. I only know this from experience and various blog posts.

For this question I am ignoring semi-supervised and weakly supervised methods. I am also ignoring transfer learning.

Topic convergence supervised-learning reference-request machine-learning

Category Data Science


This is a pretty involved question since this is an active area of research. The first statement is that often, the architecture is important (or number of parameters) before we can say something to the effect of we require $O(n^{k} log(\frac{1}{\delta^i}))$ for $i, k \ge 1$ samples to converge to a local optima. Guaranteeing accuracy is also depending on the data, so it is likely specific to what the data is itself. So you can break your question down into the analysis of stochastic gradient descent and the analysis of it in the context of neural networks. Unfortunately, neither of these deal with specific datasets, so your question about wanting accuracy $\ge n$ is still not possible with such a claim. To the best of my knowledge I am not familiar with claims made along that direction, however, with the former (analysis of SGD/specific neural architectures), there are some claims and papers that I can link below.

  1. On the convergence rate of training recurrent neural networks. While I haven't read this paper -- the results appear to be specific to recurrent neural networks architectures and they give an analysis of a regression type error for certain parameterized forms of RNNs. Their results are also specific to the training error. (https://arxiv.org/abs/1810.12065)
  2. On the rate of convergence of fully connected very deep neural network regression estimates -- This paper deals with a specific loss function not often found in classification ($\ell_2$ norm between the predictions and targets) but might serve as a useful reference. (https://arxiv.org/abs/1908.11133)
  3. Optimization Methods for Large Scale ML -- a pretty comprehensive document on techniques and guarantees for optimization methods, but probably lacking specific results related to neural nets. Also a great reference point for earlier work on the subject. (https://arxiv.org/abs/1606.04838)

Note that this isn't supposed to be a comprehensive list by any means, in fact, this probably just scratches the surface of it.

There are also papers of a different flavour that use Langevin dynamics as a way to analyze the descent trajectory of SGD and provide bounds for it. Here's one https://arxiv.org/abs/1707.06618 but there are obviously several more in the reference of the paper.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.