The error can have different forms depending on the application. For example for a simple regression we often use the sum of squared deviations between the actual output $y_n$ for the input $x_n$ and the predicted output $\hat{y}(x_n)$ for the input $x_n$. The total loss $J_\text{Gauss}$ is then given as the sum over all squared errors (also known as Gaussian Loss)for each observation.
$$J_\text{Gauss}= \sum_{n=1}^N\left[y_n-\hat{y}(x_n)\right]^2$$
If we use absolute values instead of squares we obtain the Laplacian loss function $J_\text{Laplace}$, which is given by
$$J_\text{Laplace}=\sum_{n=1}^N\left|y_n-\hat{y}(x_n)\right|$$
If we rather try to compare two probability distributions $p(x)$ and $q(x)$ we use a unsymmetric distance meassure called Kullback-Leibler divergence
$$
{\displaystyle D_{\text{KL}}(P\parallel Q)=\int _{-\infty }^{\infty }p(x)\ln {\frac {p(x)}{q(x)}}\,dx}.
$$
For binary classification we can use the hinge-loss
$$J_\text{hinge}=\sum_{n=1}^N\max \{0, 1- t_n \hat{y}(x_n)\},$$
in which $t_n=+1$ if observation $x_n$ is from the postive class and $t_n=-1$ from the negative class.
For support vector regression the $\varepsilon$-insensitive loss $J_\varepsilon$ is used. It is defined by the following equation.
$$J_\varepsilon=\max\{0,|y_n-\hat{y}(x_n)|-\varepsilon\}$$
This loss acts like a threshold. It will only count something as an error if the error is larger then $\varepsilon$.
As you can see there are some meassures (see this Wikipedia article for loss functions used for classification) of error for comparing the predicted output and the observed output.