Consider a ordinary linear regression (OLS, omitting index $i$ for convenience)
$$y=\beta_0+\beta_1X+u. $$
You can solve this using matrix algebra $\hat{\beta}=(X'X)^{-1}X'y$. Given some data $y,X$, the resulting coefficients $\hat{\beta}$ will always be the same. There is no random element to it as you simply minimize the sum of squared residuals.
Now if you look at the definition of neural nets (see "Elements of Statistical Learning" ESL, Ch. 11), you see that there are "derived features" $Z$ ($\sigma$ is the activation function)
$$Z = \sigma(\alpha + \alpha^{T}X) ,$$
which are used in a linear-like model
$$ T = \beta_0 + \beta^{T}Z,$$
where some output function $g(T)$ is used to finally transform the vector of outputs (e.g. softmax in classification, identity function for regression). See ESL, eq. 11.5, p. 392.
This process is much more demanding compared to some "ordinary" linear regression. When you skip the hidden units and plug $X$ into the second equation $T...$ you essentially have a linear (like) model, very similar to the linear (OLS) model presented above.
However, once you invoke the first equation (derived features) $Z...$, you do kind of a basis expansion to find a representation of the data which fits "well" to your target (to "explain" it).
So once you have this "deep" aspect of learning (using "derived features"), you can end up in different situations, contingent on the learning path, the chosen hyper parameter, model specification etc. It is hard to control or trace this process, let alone to understand the [in case of neural nets often large amount of] parameters (which is easy in linear regression for instance).
So essentially, the problem you described seems to stem from the way how features $Z$ are derived. Unlike you fully fix all random elements during learning, you may end up with different ways how features are "derived", which will have consequences for the final outcome $g(T)$.