Generalization Error Definition

I was reading about PAC framework and faced the definition of Generalization Error. The book defined it as:

Given a hypothesis h ∈ H, a target concept c ∈ C, and an underlying distribution D, the generalization error or risk of h is defined by



The generalization error of a hypothesis is not directly accessible to the learner since both the distribution D and the target concept c are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S.

I can not understand the equation. Can anyone please tell me how it can be interpreted? Also what is x~D?

Edit: How do I formally write this term? Is something like $$\mathbb{E}_{x \sim D} [1_{h(x)\neq c(x)}] = \int_X 1_{h(\cdot) \neq c(\cdot)} (\omega) dD(\omega)$$ correct or do I need to define some random variable? Also, to show that the empirical error $$ \hat{R}(h) = \frac{1}{m} \sum_{i =1}^m 1_{h(x_i)\neq c(x_i)} $$ is unbiased, we have $$\mathbb{E}_{S \sim D^m} [\hat{R}(h)] = \frac{1}{m} \sum_{i =1}^m \mathbb{E}_{S \sim D^m} ~ \left[ 1_{h(x_i)\neq c(x_i)} \right] = \frac{1}{m} \sum_{i =1}^m \mathbb{E}_{S \sim D^m} ~ \left[ 1_{h(x)\neq c(x)} \right]$$, but how do we formally get $$ \mathbb{E}_{S \sim D^m} ~ \left[ 1_{h(x)\neq c(x)} \right]= \mathbb{E}_{X \sim D} ~ \left[ 1_{h(x)\neq c(x)} \right] = R(h)$$

I think that I understand it intuitionally, but I can't write it down formally. Any help is much appreciated!

Topic pac-learning deep-learning machine-learning

Category Data Science


There exists somewhere in the world a distribution $D$ from which you can draw some samples $x$. The notation $x \sim D$ simply states that the sample $x$ came from the specific distribution that was noted as $D$ (e.g. Normal or Poisson distributions, but also the possible pixel values of images of beaches).

Say you have some ground truth function, mark it as $c$, that given a sample $x$ gives you its true label (say the value 1). Furthermore, you have some function of your own, $h$ that given some input, it outputs some label.

Now given that, the risk definition is quite intuitive: it simply "counts" the number of times that $c$ and $h$ didn't agree on the label. In order to do that, you (ideally) will

  • go over every sample $x$ in your distribution (i.e. $x \sim D$).
  • run it through $c$ (i.e. $c(x)$) and obtain some label $y$.
  • run it through $h$ (i.e. $h(x)$) and obtain some label $\hat{y}$.
  • check if $y \neq \hat{y}$. If so, you add 1 to your count (i.e. $1_{h(x) \neq c(x)}$ - that notes the indicator function)

Now last thing to note is that I wrote above "count", but we don't really care if the number is 500 or 100, we care for the relative number of mistakes (like 40% or 5% of the samples that were checked were classified differently). That is why it is noted as expectancy ($\mathbb{E}$).

Let me know if that was clear enough :-)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.