Why transpose of independent feature matrix is necessary in case of linear regression?

I can follow classical linear regression steps:

$Xw=y$

$X^{-1}Xw=X^{-1}y$

$Iw=X^{-1}y$

$w=X^{-1}y$

However, on implementing in Python, I see that instead of simply using

w = inv(X).dot(y)

they apply

w = inv(X.T.dot(X)).dot(X.T).dot(y)

What is the explanation of the transpositions and the two times multiplication here? I'm confused...

Topic linear-algebra linear-regression

Category Data Science


OLS (linear regression) would be solved by:

$$ (X‘X)^{-1} X‘y = \hat{\beta}. $$

Assuming a matrix $X$ (with the first column equal to 1 for each row to emulate the intercept) and vector $y$ in Python, you can solve $\hat{\beta}$ by:

np.linalg.inv(X.T @ X) @ X.T @ y

Your procedure is not correct. You are using the inverse (which does not exist in the general case). You have to use the transpose.

$$y=X\beta$$ $$X^Ty=X^TX\beta$$ $$ \hat{\beta} = [X^TX]^{-1}X^Ty$$

You might be asking why we multiplied with the transpose. In general you data matrix $X$ is not square, hence it is not invertible. In order to get something with square format/dimension we multiplied with $X^T$. If $X$ has $n$ rows (observations) and $m$ columns (features, inputs) then the transpose has $m$ rows and $n$ columns. Hence $X^TX$ is squared with $m \times m$ as its dimension. In most situations we can invert this product.


That answer comes from the set of weights $w$ (or $\theta$) that analytically solves the cost function which is defined to be

$J(\theta) = (X\theta - y)^T (X\theta - y)$

(See here for more info)

Expanding the cost function we get

$J(\theta) = \theta^TX^TX\theta - 2 y^TX\theta + y^Ty$

(Note that all three terms come out to be scalers)

Before we take the next step, we need to brush up on derivatives of matrices

Some common matrix derivative formulas for reference:

$\frac{\partial (AX)}{\partial X} = A^T \ ;\ \frac{\partial (X^TA)}{\partial X} = A \ ;\ \frac{\partial (X^TX)}{\partial X} = 2X \ ;\ \frac{\partial (X^TAX)}{\partial X} = AX+A^TX$

Using those rules we can take the derivative of the cost function with respect to $\theta$

$\frac{\partial J(\theta)}{\partial \theta} = 2 X^TX \theta - 2 X^Ty$

Setting this to 0 we get

$2 X^TX\theta - 2 X^T y = 0$

solving for $\theta$ we get

$\theta = (X^TX)^{-1} X^T y$

Written in code that's

w = inv(X.T.dot(X)).dot(X.T).dot(y)

Hope this helps.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.