Intuition behind Occam's Learner Algorithm using VC-Dimension

So I'm learning about Occam's Learning algorithm and PAC-Learning where for a given hypothesis space $H$, if we want to have a model/hypothesis $h$ that has an True error of $error_D \leq \epsilon$, with a probability of $(1-\delta)$ for a given probability $\delta$, we need to train it on $m$ examples with $m$ being defined as:

$$ m \frac{1}{2\epsilon^2}\{\log(|H|)+log(\frac{1}{\delta})\}$$

Now, I'm looking for some way to explain the terms of the equation in very simple terms to gain some intuition into why this equation is the way it is, without reverting to a complicated mathematical proof. And so, while being introduced to the concept of VC-dimensionality which is, as I understand it, a measure of how complicated the example space $X$ can be while still being linearly separable within $H$.

Substituting $\log(H)$ for $ VC(H)$ (and adding a few constants) I found the following equation:

$$ m \frac{1}{\epsilon}\{8VC(H)\log(\frac{13}{\epsilon})+4\log\frac{2}{\delta}\}$$

Which then made sense to me because basically if the size of $H$,i.e. the dimensionality $|H|$, then, to shatter $m$ examples we must have that: $|H|2^m$, or in other words $log(|H|) \geq VC(H)$.

So far I've just been restating things I've learned online in my own words.

Now, if you will allow me to try to very hand-wavingly reprove the first equation above, here is my restatement of Occam's Learning Algorithm in simple terms:

In order to have an algorithm that performs perfectly over hypothesis space $H$, we must train it on at least every possible partition of $H$, which is defined as: $2^m$ where $2^m |H|$

Question 1: Are all my assumptions so far correct?

Question 2: Does this mean my intuition is correct?

If so, then the rest is just algebraic manipulation where we first add some probability that $|H|$ is not expressive enough by some constant, we call this $\delta$:

$$ 2^m \frac{|H|}\delta $$

Isolate $m$, split up the $\log$ terms using logarithmic identities:

$$ m \log(|H|) + \log(\frac{1}{\delta}) $$

Adding some margin of error $\epsilon$ to the size of $H$, square it, and add a few constants, we get:

$$ m \frac{1}{2\epsilon^2} \{\log(|H|) + \log(\frac{1}{\delta}) \} $$

Question 3: Did I just understand this in simple terms or am I way off?

Topic vc-theory theory pac-learning machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.