Why transpose of independent feature matrix is necessary in case of linear regression?

Question

Why transpose of independent feature matrix is necessary in case of linear regression?

Fredrik

2021年12月15日 22:51

I can follow classical linear regression steps:

$Xw=y$

$X^{-1}Xw=X^{-1}y$

$Iw=X^{-1}y$

$w=X^{-1}y$

However, on implementing in Python, I see that instead of simply using

w = inv(X).dot(y)

they apply

w = inv(X.T.dot(X)).dot(X.T).dot(y)

What is the explanation of the transpositions and the two times multiplication here? I'm confused...

Topic linear-algebra linear-regression

Category Data Science

Peter · Accepted Answer · 2021年12月15日 22:51

OLS (linear regression) would be solved by:

$$ (X‘X)^{-1} X‘y = \hat{\beta}. $$

Assuming a matrix $X$ (with the first column equal to 1 for each row to emulate the intercept) and vector $y$ in Python, you can solve $\hat{\beta}$ by:

np.linalg.inv(X.T @ X) @ X.T @ y

MachineLearner · Accepted Answer · 2021年6月9日 21:59

Your procedure is not correct. You are using the inverse (which does not exist in the general case). You have to use the transpose.

$$y=X\beta$$ $$X^Ty=X^TX\beta$$ $$ \hat{\beta} = [X^TX]^{-1}X^Ty$$

You might be asking why we multiplied with the transpose. In general you data matrix $X$ is not square, hence it is not invertible. In order to get something with square format/dimension we multiplied with $X^T$. If $X$ has $n$ rows (observations) and $m$ columns (features, inputs) then the transpose has $m$ rows and $n$ columns. Hence $X^TX$ is squared with $m \times m$ as its dimension. In most situations we can invert this product.

A Kareem · Accepted Answer · 2020年5月11日 15:34

That answer comes from the set of weights $w$ (or $\theta$) that analytically solves the cost function which is defined to be

$J(\theta) = (X\theta - y)^T (X\theta - y)$

(See here for more info)

Expanding the cost function we get

$J(\theta) = \theta^TX^TX\theta - 2 y^TX\theta + y^Ty$

(Note that all three terms come out to be scalers)

Before we take the next step, we need to brush up on derivatives of matrices

Some common matrix derivative formulas for reference:

$\frac{\partial (AX)}{\partial X} = A^T \ ;\ \frac{\partial (X^TA)}{\partial X} = A \ ;\ \frac{\partial (X^TX)}{\partial X} = 2X \ ;\ \frac{\partial (X^TAX)}{\partial X} = AX+A^TX$

Using those rules we can take the derivative of the cost function with respect to $\theta$

$\frac{\partial J(\theta)}{\partial \theta} = 2 X^TX \theta - 2 X^Ty$

Setting this to 0 we get

$2 X^TX\theta - 2 X^T y = 0$

solving for $\theta$ we get

$\theta = (X^TX)^{-1} X^T y$

Written in code that's

w = inv(X.T.dot(X)).dot(X.T).dot(y)

Hope this helps.

Why transpose of independent feature matrix is necessary in case of linear regression?

About