Why does PAC learning focus on learnability of the hypothesis class and not the target function?

The definition of PAC learning is roughly

An algorithm is a PAC learning algorithm if it given enough data, for any target function, it asymptotically does as well as it possibly could given the functions it's capable of representing.

This definition seems sort of unambitious. In reality, I care more about approximating the target function well in an absolute sense, not just approximating it as well as my hypothesis class can muster. By some kind of no-free-lunch principle, it's probably not possible to be able to do this for all target functions, so we need to be satisfied with algorithms that learn well on a class of target functions. If I were to take a shot at defining a notion of learning, it would look like this:

An algorithm learns a class $\mathcal V$ of random variables in $\mathcal X\times\mathbb R$ iff for any $(X, Y)\in\mathcal V$, then for any $\delta, \epsilon$, given enough i.i.d. samples of $(X, Y)$, the algorithm returns a function $h$ such that $P(|h(X)-Y|\leq\epsilon)1-\delta$.

A class $\mathcal V$ of random variables is learnable iff there exists an algorithm that learns it.

I have two questions:

  1. Why the emphasis on getting good bounds for approximating the hypothesis class, rather than the target function/distribution, when surely it's the latter we care about?
  2. Is something like my definition also studied?

Topic pac-learning

Category Data Science


Fair warning, this is just an intuition and I'm not really expert in this kind of question. Nice question anyway :)

Theoretical models of learning like PAC are meant to be used to prove learnability results. Therefore what matters is not only that the definition corresponds to the intuition of what "learning" means, but also that it's actually technically possible to prove anything with this definition. I suspect that this is why (or one of the reasons why) the PAC definition is constrained to the particular class of functions that the algorithm deals with, since it makes it possible to theoretically determine what is the true best function in this class. This is needed to prove (or disprove) that the algorithm can always return it.

I also suspect that in the universe of all possible functions, the optimal function could always be some kind of infinite non-generalizing function which assigns exactly the true $Y$ for any $X$ (at least to the extent possible, i.e. when $X\neq X' \Rightarrow Y\neq Y'$). In this perspective restricting the class of target functions forces the algorithm to generalize, since it must represent the data in the finite space of the parameters of the class of functions.

This could explain why the definition proposed by OP is not usable: if I'm not mistaken, there's simply no way to prove any learnability result with it.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.