Amount of data needed for deep learning vs support vector machine

Question

Amount of data needed for deep learning vs support vector machine

Gesetzt

2022年2月12日 20:00

I often read about the fact, that the amount of data to train and get a generalizing model for a deep learning algorithm is much higher in comparison, e.g. to a support vector machine. It makes sense, because of the huge amount of parameters in a deep learning approach, which potentially leads to overfitting.

However: Are there any systematic studies on this? Do deep learning approaches really need more data?

Best regards, Gesetzt

Topic bias overfitting neural-network svm

Category Data Science

Valentin Calomme · Accepted Answer · 2020年5月14日 07:59

When it comes to how much data model M needs to accurately model problem P, there are 3 factors to consider:

What is the dimensionality of model M?

The power of neural networks is that they can model functions up to to a high number of dimensions, which means that they can, therefore, model any function with fewer dimensions. Indeed, a quadratic function can always be estimated by a cubic function, whereas the converse isn't true.

Support vector machines are an interesting type of models because the kernel trick allows them to model arbitrarily high-dimensional relationships. This allows you to eventually tweak your kernel based on the amount of data at your disposal.

What is the dimensionality of the space represented by the data at your disposal?

Not all data are created equal, and the amount of data is only a proxy to estimate the dimensionality of the problem space. Always consider the possibility that your data might be an unrepresentative sample of your space. The amount and type of noise present in the data is also a huge factor. A factor which gets mitigated the more data you have.

What is the true dimensionality of problem P?

Easily the most important factor in terms of how much data you need, and sadly, the only factor that we can never truly know. In a perfect world, the dimensionality of the model should be greater or equal to the dimensionality of the problem. Equal being, of course, the ideal scenario.

Do deep learning approaches really need more data?

Theoretically, the need the amount of data that will appropriately match the true dimensionality of the problem. Because they have a higher number of parameters, more data generally help reduce the chance of overfitting. However, as previously stated, a model with a dimensionality much higher than the problem's is ill-suited, to begin with.

The power behind neural networks is that they have such a large solution space that the chance to converge in a satisfying spot is quite high.

Amount of data needed for deep learning vs support vector machine

About