Reason of use of Product rule of Probability in Predicting Error

Question

Reason of use of Product rule of Probability in Predicting Error

Satyendra

2021年3月16日 12:06

P(x,y) = P(y|x)P(x) Why do we use this in estimating expected prediction error? i.e. E{(y - f(x))^2} I researched and I came to know that it helps in figuring out noise but How?

Topic error-handling regression

Category Data Science

matthiaw91 · Accepted Answer · 2020年6月18日 20:54

The general idea is that there is a true joint probability $p(x,y)$ which can be factorized like $$p(x,y) = p(y|x)p(x)$$ $p(x)$ basically measures the probability of observing the input $x$, while $p(y|x)$ represents the true relationship between $x$ and $y$.

We would like to learn a model distribution $p_\theta(x, y)$ which is factorized accordingly $$p_\theta(x,y) = p_\theta(y|x)p(x)$$ $p(x)$ can be considered the same in both cases, so what we are trying to find is $p_\theta(y|x)$ as an approximation to $p(y|x)$ by optimizing the model parameters $\theta$.

In the case of regression, we make the assumption that there is a function $f_\theta(x)$ and the observed $y$ is distributed normally around $f_\theta(x)$ with noise variance $\sigma^2$, i.e. $$p_\theta(y|x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(y-f_\theta(x))^2}{2\sigma^2} \right]$$

Since $p_\theta(x,y)$ should approximate $p(x,y)$ as closely as possible, we want to minimize the KL-divergence between the two

$$ \begin{align} &D_{KL} \left[ p(x,y) | p_\theta(x,y) \right] \\ &= \int \int p(x,y) \log\left[ \frac{p(x,y)}{p_\theta(x,y)} \right] dx dy \\ &= \int \int p(x,y) \log p(x,y) dx dy - \int \int p(x,y) \log p_\theta(x,y) dx dy \end{align} $$ Note that the first integral does not depend on $\theta$. The second term can be further simplified using the facotization

$$\int \int p(x,y) \log p_\theta(x,y) dx dy = \int \int p(x,y) \log p_\theta(y|x) dx dy + \int \int p(x,y) \log p(x) dx dy$$

Here the second term does not depend on $\theta$. Using all this, we can write the optimization problem for the KL-divergence as

$$ \begin{align} \theta^* &= \text{argmin}_\theta D_{KL} \left[ p(x,y) | p_\theta(x,y) \right] \\ &= \text{argmin}_\theta \left[ - \int \int p(x,y) \log p_\theta(y|x) dx dy \right] \end{align} $$

Plugging in the model assumption about $p_\theta(y|x)$ this gives

$$ \begin{align} \theta^* &= \text{argmin}_\theta \left[ \int \int p(x,y) \frac{(y-f_\theta(x))^2}{2\sigma^2} dx dy + \log\sqrt{2\pi\sigma^2}\right] \\ &= \text{argmin}_\theta \left[ \int \int p(x,y) (y-f_\theta(x))^2 dx dy \right] \end{align} $$ where we use that neither the constant shift nor the constant scale change the solution of the $\text{argmin}_\theta$.

This way we have derived the mean squared error as the loss for regression from probabilties and their factorizations. The final step would be that we use the empirical risk to approximate the expectaion / integral, i.e. $$ \int \int p(x,y) (y-f_\theta(x))^2 dx dy \approx \frac{1}{N} \sum_{n=1}^N (y_n - f_\theta(x_n))^2 $$ which repsoduces the mean square error as a loss.

Reason of use of Product rule of Probability in Predicting Error

About