Why GP posterior variance is the worst-case error?(exact proof)
I am reading this paper, which explains the connecting idea Gaussian Process and Kernel methods in detail. I am impressed by the insightful explanation in this paper, but am stuck on one part in Chapter 3, Section 3.4 Error Estimates: Posterior Variance and Worst-Case Error.
In this section (p24) the authors suggests that Proposition 3.8 can be proved using Lemma 3.9.
Proposition 3.8. Let $\bar{k}$ be the posterior covariance function (17) with noise variance $\sigma^2$. Then, for any $x\in\mathcal{X}$ with $x\neq x_i$, $i=1,\dots,n$, we have \begin{align}\sqrt{\bar{k}(x,x)+\sigma^2}=\sup_{g \in \mathcal{H}_k^{\sigma}:||g||_{\mathcal{H}_k^{\sigma}}}\Bigg(g(x)-\sum_{i=1}^n w_i^{\sigma}(x)g(x_i)\Bigg).\label{eqn37}\tag{37}\end{align}
and
Lemma 3.9. Let $k$ be a kernel on $\mathcal{x}$ and $\mathcal{H}_k$ be its RKHS. Then for any $m\in\mathbb{N}$, $x_1,\dots,x_m\in\mathcal{X}$ and $c_1,\dots,c_m\in\mathbb{R}$, we have \begin{align}\Bigg|\Bigg|\sum_{i=1}^m c_i k(\cdot,x_i)\Bigg|\Bigg|_{\mathcal{H}_k}=\sup_{f \in \mathcal{H}_k:||f||_{\mathcal{H}_k}}\sum_{i=1}^m c_i f(x_i).\label{eqn38}\tag{38}\end{align}
This paper says then, \begin{align}\Bigg|\Bigg|k^\sigma(\cdot,x)-\sum_{i=1}^n w_i^\sigma(x) k^\sigma(\cdot, x_i)\Bigg|\Bigg|_{\mathcal{H}_k}=\sup_{g \in \mathcal{H}_k^{\sigma}:||g||_{\mathcal{H}_k^{\sigma}}}\Bigg(g(x)-\sum_{i=1}^n w_i^{\sigma}(x)g(x_i)\Bigg).\label{eqn39}\tag{39}\end{align}
but here, I think the RHS can be at most $$\sup_{g \in \mathcal{H}_k^{\sigma}:||g||_{\mathcal{H}_k^{\sigma}}}\Bigg(k^\sigma(\cdot,x)-\sum_{i=1}^n w_i^{\sigma}(x)g(x_i)\Bigg).$$ since the only part depending on $x_i's$ is $k^\sigma(\cdot,x_i)$ in (\ref{eqn39}), the third term.
Could anyone help me understand the reasoning of the authors?
Topic gaussian-process deep-learning statistics machine-learning
Category Data Science