Why are deep learning models unstable compare to machine learning models?

I would like to understand why deep learning models are so unstable. Suppose I use the same dataset to train a machine learning model multiple times (for example logistic regression) and a deep learning model multiple times as well (for example LSTM). After that, I compute the average of each model and its standard deviation. The standard deviation of the deep learning model will be much more higher than that of the machine learning model. why is this so?

Does this have anything to do with the weight initialization in deep learning approaches. If this is the case, why does the model not always converge at the same point?

Topic weight-initialization cnn deep-learning logistic-regression machine-learning

Category Data Science

As you are looking for information from reputed resources,

  1. Tutorial why produces different results: gives reasoning why simple ML algorithm give better performance and more stable compared to neural network.
  2. Paper on industrial recognition tasks: for small amounts of training data, classical classifiers provided better performance to not pre-trained neural networks
  3. Paper on Heart Failure: compares performance of deep learning versus logistic regression
  4. Paper on landslide susceptibility assessment: proposes DLNN model had a higher performance than the four benchmark models; Multi Layer Preceptron Neural Network, a Support Vector Machine, a C4.5-Decision Tree model and a Random Forest model

RESOURCE 1 - Tutorial:

This study different results each time in machine learning, compares simpler ML algorithms(linear regression and logistic regression) with neural networks and explains why results vary

  1. Algorithm's sensitivity to specific data

    • High Variance: Algorithm is more sensitive to the specific data used during training.
    • Low Variance: Algorithm is less sensitive to the specific data used during training.

    It is said that simpler algorithms like linear regression and logistic regression have a lower variance than other types of algorithms.

    Considering in your observation, LSTM having high standard deviation shows its more sensitive than classic machine learning model(logistic regression)

    To ensure low variance: Change hyperparameter, change size of training dataset and change to simpler algorithms.

  2. Nature of algorithm

    • Deterministic machine learning algorithms: That means, when the algorithm is given the same dataset, it learns the same model every time. An example is a linear regression or logistic regression algorithm.

    • Stochastic algorithms(not deterministic): Their behaviour incorporates elements of randomness. An example of an algorithm that uses randomness during learning is a neural network. It uses randomness in two ways: Random initial weights (model coefficients) and Random shuffle of samples each epoch.

    Neural networks (deep learning) are a stochastic machine learning algorithm. The random initial weights allow the model to try learning from a different starting point in the search space each algorithm run and allow the learning algorithm to “break symmetry” during learning. The random shuffle of examples during training ensures that each gradient estimate and weight update is slightly different.

    Solution: Controlling the randomness used by algorithms ensuring each time algorithm is run it gets the same randomness

  3. Evaluation Procedure

    The two most common evaluation procedures are a train-test split and k-fold cross-validation. These model evaluation procedures are stochastic, small decisions made in the process involve randomness.

  4. Observation Order

    The order that the observations are exposed to the model affects internal decisions. Some algorithms are especially susceptible to this, like neural networks

RESOURCE 2 - Research paper on industrial recognition tasks

Comparison of the performance of innovative deep learning and classical methods of machine learning to solve industrial recognition tasks

Comparisons were made using the recognition rates achieved with five real data sets from industrial applications. The results showed that not pre-trained neural networks produce worse results than classical classifiers with the given small amounts of data for training.

Deep neural networks require an extremely large amount of data for training in order to develop a good generalization capability and thus deliver good results

Even if many training objects are available, deep neural networks remain susceptible to overfitting. Due to the large amount of data to be processed, the training also takes up a lot of time and can take days or even weeks 10. Very high computing power is required for the effective use of these methods 22. For this reason, such applications should be processed by the Graphics Processing Unit (GPU) instead of the Central Processing Unit (CPU) for saving time 10. In addition, many parameters have to be set and optimized, e.g. initial weights, activation function, learning rate, batch size 23. According to Bengio 21, the quality of the results depends on the initial values.Since neural networks are black boxes, the decision-making process is not user-comprehensible.

enter image description here

RESOURCE 3 - Research paper on Heart Failure

Neural networks Vs Logistic regression, 30 days all-cause readmission prediction

The question of deep learning versus logistic regression for readmission prediction for Heart Failure, shows that logistic regression with regularization matches the best neural network performance. AUC Comparision

Table of comparision

RESOURCE 4 - Research paper on landslide susceptibility assessment

Comparing the prediction performance of a Deep Learning Neural Network model with conventional machine learning models in landslide susceptibility assessment

The learning ability of the DLNN model has been evaluated and compared with a Multi Layer Preceptron Neural Network, a Support Vector Machine, a C4.5-Decision Tree model and a Random Forest model using the training dataset, whereas the predictive performance of each model has been evaluated and compared using the validation datasets. In order to evaluate their learning and predictive capacity of each model the classification accuracy, the sensitivity, the specificity and the area under the success and predictive rate curves (AUC) were calculated. Results showed that the proposed DLNN model had a higher performance than the four benchmark models. Although DLNN has been used seldom in landslide susceptibility assessments, the study highlights that the usage of deep learning approach could be considered as a satisfactory alternative approach for landslide susceptibility mapping.


  1. Different results each time in machine learning
  2. How to reduce model variance
  3. Randomness in machine learning
  4. Comparison of the performance of innovative deep learning and classical methods of machine learning to solve industrial recognition tasks
  5. Comparing the prediction performance of a Deep Learning Neural Network model with conventional machine learning models in landslide susceptibility assessment
  6. Neural networks versus Logistic regression for 30 days all-cause readmission prediction

Consider a ordinary linear regression (OLS, omitting index $i$ for convenience)

$$y=\beta_0+\beta_1X+u. $$

You can solve this using matrix algebra $\hat{\beta}=(X'X)^{-1}X'y$. Given some data $y,X$, the resulting coefficients $\hat{\beta}$ will always be the same. There is no random element to it as you simply minimize the sum of squared residuals.

Now if you look at the definition of neural nets (see "Elements of Statistical Learning" ESL, Ch. 11), you see that there are "derived features" $Z$ ($\sigma$ is the activation function)

$$Z = \sigma(\alpha + \alpha^{T}X) ,$$

which are used in a linear-like model

$$ T = \beta_0 + \beta^{T}Z,$$

where some output function $g(T)$ is used to finally transform the vector of outputs (e.g. softmax in classification, identity function for regression). See ESL, eq. 11.5, p. 392.

This process is much more demanding compared to some "ordinary" linear regression. When you skip the hidden units and plug $X$ into the second equation $T...$ you essentially have a linear (like) model, very similar to the linear (OLS) model presented above.

However, once you invoke the first equation (derived features) $Z...$, you do kind of a basis expansion to find a representation of the data which fits "well" to your target (to "explain" it).

So once you have this "deep" aspect of learning (using "derived features"), you can end up in different situations, contingent on the learning path, the chosen hyper parameter, model specification etc. It is hard to control or trace this process, let alone to understand the [in case of neural nets often large amount of] parameters (which is easy in linear regression for instance).

So essentially, the problem you described seems to stem from the way how features $Z$ are derived. Unlike you fully fix all random elements during learning, you may end up with different ways how features are "derived", which will have consequences for the final outcome $g(T)$.

May I suggest setting the random parameters of your ML or DL model to some constant wherever possible and then compare the two models. Also you can use GridSearch to find the best parameters in both cases for your models and you can see what changes do you observe, if your data is that you split for train/test remains the same.

When it comes to stability, your DL models solvers due relatively large number of hyperparameters and also due to their random nature e.g SGD that heavily depend on the step size which mean it may or may not converge based on the initialization, you can try with different variations of these parameters to find when the model is fitting well without overfitting/underfitting. From my experience, take a small random subsample of your data that can be quickly trained, and do as I mentioned to find the best parameters for the ML and DL model.

There could be many reasons for deep learning to have high variance in evaluation metric performance. Here are a couple of ideas:

  • Initialization: Deep learning models are initialized with random parameter values. Different starting parameters could result in final parameter values, especially if there are few epochs. Traditional machine learning might not have random parameter initialization.

  • Optimization: Deep learning is often optimized with stochastic gradient descent (SGD) which does not have convergence guarantees. Traditional machine learning algorithms can be optimized with other methods that have convergence guarantees.

  • Depth: Deep learning is a stack of non-linearities which is a complex system that could result in different solutions in different runs. Traditional machine learning might not have the same complexity.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.