Linear vs Non linear regression (Basic Beginner)

So my doubt is basically in Linear regression, We try to fit a straight line or a curve for a given training set. Now, I believe whenever the features (independent variable) increases, parameters also increases. Hence computing these parameters is computationally expensive. So, I guess that's the reason we move to Non linear!? Is my understanding right?

And my next doubt is, in overfitting for linear regression, we say that the model memorizes. What I understand is that the parameters are adjusted in such a way that it can only predict for the given input and will give bad results for output And the reason is because we haven't fitted a generalized line or curve. But we are plotting a line or curve that passes through all output values. Is my understanding correct?

Topic non-parametric linear-regression neural-network machine-learning

Category Data Science


Linear regression is extremely easy to compute. The model is defined in matrix form:

$$ y = X \beta + u. $$

The $X$ are the explanatory variables and $u$ is the statistical error term. The "columns" in matrix $X$ are the variables or features. As long as there are more "rows" (observations) than "columns" (variables/features) it is extremely easy to compute the coefficients $\beta$. You just need to solve the equation:

$$ \hat{\beta} = (X'X)^{-1} X'y.$$

Even with relatively large datasets, computers can solve this equation very fast. Alternatively you can use gradient descent to find the optimum.

How is the optimum obtained (so how is the vector $\beta$ found)? The criterion simply is to find a vector $\beta$ which minimises the sum of squared errors $u^2$ for given $X$.

You can adress non-linearity (in $X$) to some extent using linear regression. Say there is one explanatory variable $x$, we can write a linear model like:

$$ y = \beta_0 + \beta_1 x + u. $$

This basically is a linear function with intercept $\beta_0$ and slope $\beta_1$. Now suppose that the true data are not perfectly linear but show some non-linear pattern. In this case you can add additional terms to your model to try to capture this non-linearity. Any linear transformations of $x$ are allowed, e.g. you could write models like:

$$ y = \beta_0 + \beta_1 x + \beta_2 x^2 + u. $$

$$ y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_n x^n + u. $$

$$ y = \beta_0 + \beta_1 log(x) + u. $$

Using such linear transformations of $x$, it is often possible to capture quite a lot of "non-linearity" in $x$.

You can even combine several linear models to one model in order to capture quite wild non-linearity in $x$, these models are called "generalised additiive models" (GAM).

Regarding your second question: Also linear models can capture a high degree of complexity in data, especially in case many $x$ are included in the regression. So there is no ex-ante reason to believe that linear models will in general perform worse than other models (like tree-based or neural nets). Also the principle of parsimony says that we should prefer a less complex model over a more complex one if the more complex model does not yield better results than the less complex model.

If the information in your training set comes from the same "data generating process" (DGP) like the information in the data you want to predict, you may be able to produce proper predictions. Suppose you employ an extremely fancy model to learn a DGP but when it comes to predictions, the data you have to predict some outcome come from an entirely different DGP. In this case, you will likely produce very bad predictions, regardless of what model you employed in the first place. So first of all, make sure that you choose a model which is able to describe/learn/predict some outcome in a good way in the training data (without being unnecessarily complex). Second, make sure that the data you want to predict, comes from the same DGP as the data you used for training. In essence, your model can only predict what it has learned before.

Have a look at "Introduction to Statistical Learning", it's a great book, which is also available online as PDF. It also comes with R labs.


Whether you use a linear model or a non-linear model depends completely on whether or not the data is linearly separable, not the number of features. If you have a dataset that is linearly separable, i.e a linear curve can determine the dependent variable, you would use linear regression irrespective of the number of features. Since real-world data is rarely linearly separable and linear regression does not provide accurate results on such data, non-linear regression is used.

And to answer your second question, yes you are correct. Overfitting occurs when your model has a very high accuracy for training data but a considerably low accuracy for validation/test data. It happens because the parameters are adjusted completely based on your training set, so there is not a generalised curve. Hence on newer data, the model would perform badly.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.