The general idea is that there is a true joint probability $p(x,y)$ which can be factorized like
$$p(x,y) = p(y|x)p(x)$$
$p(x)$ basically measures the probability of observing the input $x$, while $p(y|x)$ represents the true relationship between $x$ and $y$.
We would like to learn a model distribution $p_\theta(x, y)$ which is factorized accordingly
$$p_\theta(x,y) = p_\theta(y|x)p(x)$$
$p(x)$ can be considered the same in both cases, so what we are trying to find is $p_\theta(y|x)$ as an approximation to $p(y|x)$ by optimizing the model parameters $\theta$.
In the case of regression, we make the assumption that there is a function $f_\theta(x)$ and the observed $y$ is distributed normally around $f_\theta(x)$ with noise variance $\sigma^2$, i.e.
$$p_\theta(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(y-f_\theta(x))^2}{2\sigma^2} \right]$$
Since $p_\theta(x,y)$ should approximate $p(x,y)$ as closely as possible, we want to minimize the KL-divergence between the two
$$
\begin{align}
&D_{KL} \left[ p(x,y) | p_\theta(x,y) \right] \\
&= \int \int p(x,y) \log\left[ \frac{p(x,y)}{p_\theta(x,y)} \right] dx dy \\
&= \int \int p(x,y) \log p(x,y) dx dy - \int \int p(x,y) \log p_\theta(x,y) dx dy
\end{align}
$$
Note that the first integral does not depend on $\theta$. The second term can be further simplified using the facotization
$$\int \int p(x,y) \log p_\theta(x,y) dx dy = \int \int p(x,y) \log p_\theta(y|x) dx dy + \int \int p(x,y) \log p(x) dx dy$$
Here the second term does not depend on $\theta$. Using all this, we can write the optimization problem for the KL-divergence as
$$
\begin{align}
\theta^* &= \text{argmin}_\theta D_{KL} \left[ p(x,y) | p_\theta(x,y) \right] \\
&= \text{argmin}_\theta \left[ - \int \int p(x,y) \log p_\theta(y|x) dx dy \right]
\end{align}
$$
Plugging in the model assumption about $p_\theta(y|x)$ this gives
$$
\begin{align}
\theta^* &= \text{argmin}_\theta \left[ \int \int p(x,y) \frac{(y-f_\theta(x))^2}{2\sigma^2} dx dy + \log\sqrt{2\pi\sigma^2}\right] \\
&= \text{argmin}_\theta \left[ \int \int p(x,y) (y-f_\theta(x))^2 dx dy \right]
\end{align}
$$
where we use that neither the constant shift nor the constant scale change the solution of the $\text{argmin}_\theta$.
This way we have derived the mean squared error as the loss for regression from probabilties and their factorizations. The final step would be that we use the empirical risk to approximate the expectaion / integral, i.e.
$$
\int \int p(x,y) (y-f_\theta(x))^2 dx dy \approx \frac{1}{N} \sum_{n=1}^N (y_n - f_\theta(x_n))^2
$$
which repsoduces the mean square error as a loss.