Linear Regression

I'm starting to learn machine learning and one of the first things that is mentioned is the usage of a linear regression method. Basically, we have a bunch of data points and we want to fit a line such that the errors we get from the line and the actual data points are minimized.

I understand the theory and why we would use, for example, something like gradient search methods to find the global minimum point. What I don't understand is why to have to use something as complicated as gradient search or least mean squares (LMS) algorithm or whatever when, in introductory statistics, we were given a very simple -still lots of calculating, but thankfully no partial derivates- formula to find this line:

$$y=mx+b$$

For $m$, we have:

$$m = \frac{S_{xy}}{S_{xx}}= \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}$$

For $b$, we have:

$$ b= \frac{1}{n}\left(\sum y_i-m\sum x_i\right)$$

The $i$ subscripts are used to refer to each $x$ or $y$ in the data set. $X$ or $Y$ bar are averages. $n$ is the cardinality or # of data points we have in the set.

Am I missing something? Or am I trying to use one method in statistics that isn't allowed in the machine learning world?

Topic learning regression

Category Data Science


The example that you have given is a very trivial case of linear regression but it can still lead to computational problems. Imagine we have sensor inputs and we want to estimate the temperature with them. Imagine that the sensor gives us $100$ measurements per second. What will happen with your parameter estimations (especially the sums) when your system is running for one week?

As you can see the problem of this procedure is that we have to carry out all the computations at each time step. for the general linear regression

$$y_i = \boldsymbol{w}^T\boldsymbol{x}_i + \varepsilon_i$$

we can determine least squares estimate $\hat{\boldsymbol{w}}$ for the weights $\boldsymbol{w}$ by the following formula

$$\hat{\boldsymbol{w}}=\left[\boldsymbol{X}^T\boldsymbol{X} \right]^{-1}\boldsymbol{X}^T\boldsymbol{y},$$

in which $\boldsymbol{X}$ is the data matrix and $\boldsymbol{y}$ is the vector of the outputs of all observations. As you can see with an increasing amount of data the data matrix $\boldsymbol{X}$ will get an increasing number of rows. Additionally, we have to invert the product which will become more difficult if have a very high dimensional input vector $\boldsymbol{x}$. The direct calculation is very nice from the theoretical point of view but very demanding from the practical point of view.

In order to prevent these huge calculations, different recursive methods like the least mean squares (LMS) algorithm or the recursive least squares (RLS) algorithm was invented to prevent such demanding calculations.

In order to see the difference between the direct and the recursive calculations look at the continuous mean calculation

$$\bar{x}_N = \dfrac{1}{N}\sum_{n=1}^Nx_n$$ $$\text{direct: }\bar{x}_{N+1}=\dfrac{1}{N+1}\sum_{n=1}^{N+1}x_n$$ $$=\dfrac{N}{N+1}\left[\dfrac{1}{N}\sum_{n=1}^{N+1}x_n\right]$$
$$=\dfrac{N}{N+1}\left[\bar{x}_N+\dfrac{1}{N}x_{N+1}\right]$$ $$\implies \text{recursive: }\bar{x}_{N+1}=\dfrac{N}{N+1}\bar{x}_N+\dfrac{1}{N+1}x_{N+1}.$$

If you already calculated the previous mean it is clear that calculating the for the next time step is much easier to calculate with the recursive formula.


The link posted by Simon in a comment is really informative. However, for linear regression, minimizing the sum of squared residuals (OLS) still is big, especially in causal modelling. As far as I know, many programmes (R, Stata) still use the (X‘X)^-1 X‘y way to solve linear regression problems.

Often advanced algorithms have a hard time beating OLS. I think OLS often is a good (and fast) benchmark. Also the coeficients (betas) are easy to understand. So you can learn a lot about the problem you have at hand. This can be extremely helpful.

If you have the time, have a look at „Introductory Econometrics: A Modern Approach“ by J. Wooldridge. Also „An Introduction to Statistical Learning with Applications in R“ is a good source.

If you are brave, go for Russell Davidson and James G. MacKinnon: Econometric Theory and Methods.

I see that gradient decent is important. However, having a sound understanding of basic statistics will pay off massively in the long run in my opinion.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.