Linear regression is extremely easy to compute. The model is defined in matrix form:
$$ y = X \beta + u. $$
The $X$ are the explanatory variables and $u$ is the statistical error term. The "columns" in matrix $X$ are the variables or features. As long as there are more "rows" (observations) than "columns" (variables/features) it is extremely easy to compute the coefficients $\beta$. You just need to solve the equation:
$$ \hat{\beta} = (X'X)^{-1} X'y.$$
Even with relatively large datasets, computers can solve this equation very fast. Alternatively you can use gradient descent to find the optimum.
How is the optimum obtained (so how is the vector $\beta$ found)? The criterion simply is to find a vector $\beta$ which minimises the sum of squared errors $u^2$ for given $X$.
You can adress non-linearity (in $X$) to some extent using linear regression. Say there is one explanatory variable $x$, we can write a linear model like:
$$ y = \beta_0 + \beta_1 x + u. $$
This basically is a linear function with intercept $\beta_0$ and slope $\beta_1$. Now suppose that the true data are not perfectly linear but show some non-linear pattern. In this case you can add additional terms to your model to try to capture this non-linearity. Any linear transformations of $x$ are allowed, e.g. you could write models like:
$$ y = \beta_0 + \beta_1 x + \beta_2 x^2 + u. $$
$$ y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_n x^n + u. $$
$$ y = \beta_0 + \beta_1 log(x) + u. $$
Using such linear transformations of $x$, it is often possible to capture quite a lot of "non-linearity" in $x$.
You can even combine several linear models to one model in order to capture quite wild non-linearity in $x$, these models are called "generalised additiive models" (GAM).
Regarding your second question: Also linear models can capture a high degree of complexity in data, especially in case many $x$ are included in the regression. So there is no ex-ante reason to believe that linear models will in general perform worse than other models (like tree-based or neural nets). Also the principle of parsimony says that we should prefer a less complex model over a more complex one if the more complex model does not yield better results than the less complex model.
If the information in your training set comes from the same "data generating process" (DGP) like the information in the data you want to predict, you may be able to produce proper predictions. Suppose you employ an extremely fancy model to learn a DGP but when it comes to predictions, the data you have to predict some outcome come from an entirely different DGP. In this case, you will likely produce very bad predictions, regardless of what model you employed in the first place. So first of all, make sure that you choose a model which is able to describe/learn/predict some outcome in a good way in the training data (without being unnecessarily complex). Second, make sure that the data you want to predict, comes from the same DGP as the data you used for training. In essence, your model can only predict what it has learned before.
Have a look at "Introduction to Statistical Learning", it's a great book, which is also available online as PDF. It also comes with R labs.