A trick used in Rademacher complexity related Theorem

Question

A trick used in Rademacher complexity related Theorem

learning machine

2019年4月10日 12:07

I am currently working on the proof of Theorem 3.1 in the book "Foundations of Machine Learning" (page 35, First edition), and there is a key trick used in the proof (equation 3.10 and 3.11):

$$\begin{align*} E_{S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} g(z'_i)-g(z_i)\right]=E_{\boldsymbol{\sigma},S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} \sigma_i(g(z'_i)-g(z_i))\right] \\ \text{where } {\Bbb P}(\sigma_i=1)={\Bbb P}(\sigma_i=-1)=\frac{1}{2} \end{align*}$$

It is also shown in the lecture pdf page 8 in this link: https://cs.nyu.edu/~mohri/mls/lecture_3.pdf

This is possible because $z_i$ and $z'_i$ can be swapped. My question is, why can we swap $z_i$ and $z'_i$? In the book, it says this is possible because "we are taking the expectation over all possible $S$ and $S'$, it will not affect the overall expectation." Unfortunately, I don't understand the meaning or intuition of this part.

Thank you very much for reading the question and for your time!

Topic theory pac-learning machine-learning

Category Data Science

Esmailian · Accepted Answer · 2019年4月8日 20:26

This requires hell of a derivation, but I liked the question :)

My question is, why can we swap $z_i$ and $z'_i$?

The key insight is that notation $S \sim \mathcal{D}^m$ is equivalent to $Z_1 \sim \mathcal{D}, \cdots, Z_m \sim \mathcal{D}$. This translates to $$E_{S,S'}[.]=E_{Z_1,\cdots,Z_m,Z'_1,\cdots,Z'_m}[.]$$ which remains the same by reordering the $Z$s.

Therefore the swap argument can be shown by

Singling out an arbitrary term $k$ from $\sum_{i=1}^{m}$,
Switching $E_{Z_k,Z'_k}$ to $E_{Z'_k,Z_k}$, and
Renaming the variables.

That is, $$\begin{align*} &E_{S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} g(z'_i)-g(z_i)\right] \\ &\overset{\text{expand}}{=} E_{\color{blue}{Z_1,\cdots,Z_m,Z'_1,\cdots,Z'_m}}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} g(z'_i)-g(z_i)\right]\\ &\overset{\text{separate}}{=} E_{\cdots,Z_k,\cdots,Z'_k,\cdots}\left[\underset{g \in \mathcal{G}}{\text{sup}} \frac{1}{m}\left(\color{blue}{g(z'_k)-g(z_k)}+\sum_{i=1; \neq k}^{m} g(z'_i)-g(z_i)\right)\right] \\ &\overset{\text{switch}}{=} E_{\cdots,\color{blue}{Z'_k},\cdots,\color{blue}{Z_k},\cdots}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\left(g(z'_k)-g(z_k)+\sum_{i=1; \neq k}^{m} g(z'_i)-g(z_i)\right)\right] \\ &\overset{\text{reorder}}{=} E_{\cdots,Z'_k,\cdots,Z_k,\cdots}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\left(\color{blue}{-(g(z_k)-g(z'_k))}+\sum_{i=1; \neq k}^{m} g(z'_i)-g(z_i)\right)\right] (*) \end{align*}$$ Now by renaming $Z_k \leftrightarrow Z'_k$, and thus $z_k \leftrightarrow z'_k$ in $(*)$ we have: $$\begin{align*} &E_{S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} g(z'_i)-g(z_i)\right] \\ &\overset{\text{rename}}{=} E_{\cdots,\color{blue}{Z_k},\cdots,\color{blue}{Z'_k},\cdots}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\left(-(g(\color{blue}{z'_k})-g(\color{blue}{z_k}))+\sum_{i=1; \neq k}^{m} g(z'_i)-g(z_i)\right)\right]\\ &\overset{\text{collapse}}{=} E_{S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\left(-(g(z'_k)-g(z_k))+\sum_{i=1; \neq k}^{m} g(z'_i)-g(z_i)\right)\right]\\ \end{align*}$$

which corresponds to $\boldsymbol{\sigma}=(\sigma_1=1,\cdots,\sigma_k = -1,\cdots,\sigma_m=1)$.

We have proved this equality by switching the sign of arbitrary index $k$. Therefore, by defining $$f(\boldsymbol{\sigma}):=E_{S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} \sigma_i(g(z'_i)-g(z_i))\right]$$ we have shown that $$\forall \boldsymbol{\sigma_1}, \boldsymbol{\sigma_2},f(\boldsymbol{\sigma_1})=f(\boldsymbol{\sigma_2}).$$ For example: $$\begin{align*} f(\boldsymbol{1})&=f(\sigma_1=1,\cdots,\sigma_k=1, \cdots,\sigma_m=1)\\ &=f(\sigma_1=1,\cdots,\sigma_k=-1, \cdots,\sigma_m=1) \end{align*}$$ Knowing that there is $2^m$ equi-probable vectors $\boldsymbol{\sigma_i}$, we finally have $$\begin{align*} E_{S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} g(z'_i)-g(z_i)\right]&=f(\boldsymbol{1}) \\ &=\frac{1}{2^m} f(\boldsymbol{\sigma_1})+\cdots+\frac{1}{2^m}f(\boldsymbol{\sigma_{2^m}})\\ &=E_{\boldsymbol{\sigma}}[f(\boldsymbol{\sigma})] \\ &=E_{\boldsymbol{\sigma},S,S'}\left[\underset{g \in \mathcal{G}}{\text{sup}}\frac{1}{m}\sum_{i=1}^{m} \sigma_i(g(z'_i)-g(z_i))\right] \end{align*}$$ $\square$

A trick used in Rademacher complexity related Theorem

About