Kernel trick derivation: why this simplification is incorrect?

Question

Kernel trick derivation: why this simplification is incorrect?

Boris Burkov

2021年11月22日 18:12

I am trying to derive kernel trick from linear regression, and I have a mistake in the very end, which leads to an expression too simple.

Basic linear regression

For a basic linear regression (with no regularisation for simplicity), let ${\bf x_i}$ be row-vectors of data of length $p$ (for instance, each coordinate $x_{i,j}$ might be the value of expression of gene $j$ in patient $i$).

Let the corresponding data matrix $X = \begin{pmatrix} {\bf x_1} \\ {\bf x_2} \\ ... \\ {\bf x_n} \end{pmatrix}$ be $n$ x $p$ matrix of data (e.g. n - number of patients, p - number of genes), ${\bf w}$ be the vector of weights (e.g. weight $w_j$ is the contribution of expression of gene $j$ to a patient's body mass index).

In case of a regular regression we would be trying to estimate body mass index of patient $i$ as:

$h({\bf x_i}) = {\bf x_i} {\bf w}$

Fitting this to measurements vector ${\bf y}^T = (y_1, ..., y_n)$ would imply minimization of sum of square errors:

$\hat{\bf w} = \underset{\bf w}{\arg \min} ({\bf y} - X {\bf w })^{T} ({\bf y} - X {\bf w})$ which results in:

${\hat{\bf w}} = (X^T X)^{-1} \cdot X^T{\bf y}$

Kernel regression

Now, we replace $n$ x $p$ matrix $X$ of basic features with an $n$ x $P$ matrix $\Phi = \begin{pmatrix} {\bf \varphi(x_1)} \\ {\bf \varphi(x_2)} \\ ... \\ {\bf \varphi(x_n)} \end{pmatrix}$ of feature maps, where, again, each row $\bf \varphi(x_i)$ corresponds to a single patient, but now has a different, possibly infinite, length $P$ instead of $p$.

Accordingly, our weights vector $\bf w$ is now going to be $P$-vector instead of $p$-vector.

Hence, our estimate function of BMI $h({\bf \varphi({\bf x_i})}) = {\bf \varphi(x_i)} \cdot {\bf w} = {\bf \varphi(x_i)} \cdot (\Phi^T \Phi)^{-1} \cdot \Phi^T{\bf y}$.

Now we change the representation of our pseudo-inverse matrix: $(\Phi^T \Phi)^{-1} \Phi^T = \Phi^T (\Phi \Phi^T)^{-1}$.

This results in a different representation of $h({\bf \varphi({\bf x_i})})$:

$h({\bf \varphi({\bf x_i})}) = {\bf \varphi(x_i)} \cdot \Phi^T (\Phi \Phi^T)^{-1} {\bf y}$

My question

What I don't quite understand here is the following.

Denote matrix $\Phi \Phi^T = K$.

Consider expression $h({\bf \varphi({\bf x_i})}) = {\bf \varphi(x_i)} \cdot \Phi^T (\Phi \Phi^T)^{-1} {\bf y}$.

The multiplier ${\bf \phi(x_i)}\Phi^T$ is essentially the i-th row of our matrix $K$.

The other multiplier is $(\Phi \Phi^T)^{-1} {\bf y} = K^{-1} y$.

What happens, if we multiply i-th row of a matrix $K$ by its inverse? We should get a one-hot vector $(0, 0, ..., \underbrace{1}_{i-th position}, ..., 0)$, right?

So I assume $h({\bf \varphi({\bf x_i})}) = (0, 0, ..., \underbrace{1}_{i-th position}, ..., 0) \cdot {\bf y} = y_i$.

I must be wrong somewhere, but I cannot find mistakes in my reasoning.

The correct answer does not allow for such a simplification and instead takes a more general form $h({\bf \varphi({\bf x_i})}) = \sum \limits_{j=1}^n \alpha_j {\bf \phi(x_i)} {\bf \phi(x_j)}$, where vector ${\bf \alpha} = (\Phi \Phi^T)^{-1} {\bf y}$.

Topic kernel

Category Data Science

Boris Burkov · Accepted Answer · 2021年11月22日 18:12

After additional research I found out that my general direction was right, but required a more elaborate approach.

Indeed, if $P \gg n$, $\Phi^T \Phi$ is a non-full rank matrix with lots of zero eigenvalues, thus its inverse $(\Phi^T \Phi)^{-1}$ cannot exist.

However, in order to overcome this problem, we can add a Tikhonov regularisation term: $(\lambda I + \Phi^T \Phi)$, effectively increasing all the matrix eigenvalues by $\lambda$ and thus making it invertible.

After that we apply Woodbury-Sherman-Morrison formula to invert the regularized matrix and express the solution of kernel ridge regression through kernel matrix $K$:

$h(\varphi({\bf x_i})) = \varphi({\bf x_i}) \cdot (\lambda I + \Phi^T \Phi)^{-1} \Phi^T \cdot {\bf y} = \varphi({\bf x_i}) \cdot \Phi^T (\Phi \Phi^T + \lambda I)^{-1} \cdot {\bf y} = $

$= K_i \cdot (K + \lambda I)^{-1} \cdot {\bf y}$

So, yes, if not for the regularisation term, this formula would've just selected $y_i$ from outputs vector as an estimate of $h(\varphi(x_i))$. However, addition of regularisation to the kernel matrix complicates this expression.

Anyway, we end up with an explicit solution of KRR through kernel matrix $K$ and regression outputs ${\bf y}$ and through mathematical magic avoid (possibly infinite-dimensional) feature space vectors $\varphi({\bf x_i})$.

References:

https://gregorygundersen.com/blog/2020/01/06/kernel-gp-regression/ - details on Woodbury-Sherman-Morrison by Gregory Gundersen
https://web2.qatar.cmu.edu/~gdicaro/10315-Fall19/additional/welling-notes-on-kernel-ridge.pdf - notes of Max Welling on KRR
https://www.quora.com/Are-the-eigenvalues-of-a-matrix-unchanged-if-a-constant-is-added-to-each-diagonal-element - why L2 regularisation makes matrix non-singular

Nikos M. · Accepted Answer · 2021年10月28日 08:36

Now we change the representation of our pseudo-inverse matrix $(\Phi^T\Phi)^{-1} \Phi^T = \Phi^T (\Phi \Phi^T)^{-1}$.

The above cannot be done as different representations rely on different assumptions about column rank and row rank and in fact are inequivalent in general.

In particular, when $A$ has linearly independent columns (and thus matrix $A^{*}A$ is invertible), $A^{+}$ can be computed as $A^{+}=\left(A^{*}A\right)^{-1}A^{*}$.

This particular pseudoinverse constitutes a left inverse, since, in this case, $A^{+}A=I$.

When $A$ has linearly independent rows (matrix $AA^{*}$ is invertible), $A^{+}$ can be computed as $A^{+}=A^{*}\left(AA^{*}\right)^{-1}$.

This is a right inverse, as $AA^{+}=I$.

The left and right pseudo-inverses coincide only in special cases (eg when actual matrix $A$ is indeed invertible).

References: Definition of Moore-Penrose Pseudo-Inverse

Kernel trick derivation: why this simplification is incorrect?

Basic linear regression

Kernel regression

My question

About