The probably approximately correct (PAC) learning model
PAC learnability A formal, mathematical model of learnability It origins from a paper by Valiant (“A theory of the learnable”, 1984) It is very theoretical Has only very few results that are usable in practice It formalizes the concept learning task as follows: X: the space from which of the objects (examples) come. E.g. if the objects are described by two continuous features, then X=R2 c is a concept: cX. but c can also be interpreted as an X{0, 1} mapping (each point of X either belongs to c or not) C concept class: a set of concepts. In the following we always assume that the c concept we have to learn is a member of C
PAC learnability L denotes the learning algorithm: its goal is to learn a given c. As its output, it return a hypothesis h from concept class C. As a help, it has access to a function Ex(c,D), which given training examples in the form of <x,c(x)> pairs. The examples are random, independent, and follow a fixed (but arbitrary) probability distribution D. Example: X=R2 (the 2D plane) C=all such rectangles of the plain that are parallel with the axes c: one fixed rectangle h: another given rectangle D: a probability distribution defined over the plain Ex(c,D): It gives positive and negative examples from c
Defining the error After learning, how can we measure the final error of h? We could define it as error(h)=c∆h=(c\h)(h\c) (the symmetric difference of c and h – the size of the blue area in the figure) It is not good - why? D is the probability of picking a given point of the plain randomly We would like our method to work for any possible distribution D If D is 0 in a given area, we won’t get samples from there we cannot learn in this area So we cannot guarantee error(h)=c∆h to become 0 for any arbitrary D However, D is the same during testing we won’t get samples from that area during testing either no problem if we couldn’t learn there!
Defining the error Solution: let’s define the error as c∆h, but weight it by D! That is, instead of c∆h, we calculate the area under D on c∆h „Sparser” areas: they are harder to learn, but also count less in the error, as the weight, that is the value of D is smaller there As D is the probability distribution of picking the points of X, the shaded are under D is the probability of randomly picking a point from c∆h. Because of this, the error of h is defined as:
PAC learnability - definition Let C be concept class over X. C is PAC learnable, if there exists an algorithm L that satisfies the following: using the samples given by Ex(c,D) for any concept c C, any distribution D(x) and any constants 0<ε<1/2, 0<δ<1/2, with probability P≥1-δ it finds a hypothesis h C for which error(h)≤ε. Briefly: With a large probability, L finds a hypothesis with a small error (this is why L is „probably approximately correct”). 1. remark: the definition does not help find L, it just lists the conditions it has to satisfy 2. remark: notice that ε and δ can be arbitrarily close to 0 3. remark: it should work for any arbitrarily chosen c C and distribution D(x)
Why „probably approximately” ? Why do we need the two thresholds, ε and δ ? We have to allow the error to be ε because the set of examples is always finite, and so we cannot guarantee the perfect learning of a concept. The simplest example is when X is infinite, as in this case a finite amount of examples won’t ever be able to perfectly cover each point of an infinite space. Allowing a large error with probability δ is required because we cannot guarantee the error to become small for each run. For example, it may happen that each randomly picked training sample will be the same point – and we won’t be able to learn from just one sample. This is why we can guarantee to find a good hypothesis only with a certain probability.
Example X=R2 (The training examples are the points of the plane) C=the set of those rectangles which are parallel with the axes c: one given rectangle h: another fixed rectangle error(h): the area of h∆c weighted with D(x) Ex(c,D): it gives positive and negative examples Let L be the algorithm that always returns as h the smallest rectangle that contains all the positive examples! Notice that h c As the number of examples increases, the error can only decrease
Examples We prove that by increasing the number of examples the error becomes smaller and smaller with an increasing probability; that is, algorithm L learns C in a PAC sense Let 0<ε, δ≤1/2 be fixed Consider hypothesis h’ obtained as shown in the figure: ε/4 is the probability that a randomly pulled sample falls in T Then, 1- ε/4 is the probability that a randomly pulled sample does NOT fall in T The probability that m randomly pulled samples do NOT fall in T : The weighted area of the four bands is <ε (there are overlaps) So if we already pulled samples from T on each side, then error(h)< ε for the actual hypothesis h
Example So error(h) ≥ ε is possible only if we haven’t yet pulled a sample from T on at least one of the sides error(h) ≥ ε no sample from T1, T2, T3 or T4 P(error(h) ≥ ε) ≤ P(no sample from T1, T2, T3 or T4) ≤ So we got this: P(error(h) ≥ ε) ≤ We want to push this probability under a threshold δ: How does the above curve look like as a function of m (ε is fixed)? It is an exponential function with a base between 0 and 1 It goes to 0 if m goes to infinity So it can be pushed under arbitrarily small δ by selecting a large enough m
Example – expressing m We already proved that L is a PAC learner, but as an addition, we can also specify m as a function of ε and δ: We make use of the formula 1-x≤e-x in defining a threshold: That is, for a given ε and δ if m is larger than the threshold above, then P(error(h) ≥ ε) ≤ δ Notice that m grows relatively slowly as a function of ε and δ , which might be important for practical usability
Efficient PAC learning For practical considerations it would be good if m increased slowly with the decrease of ε and δ „slowly” = polynomially Besides the number of training samples, whet else may influence the speed of training? The processing cost of the training samples usually depends on the number of features, n During its operation, the algorithm modifies its actual hypothesis h after receiving each training sample. Because of this, the cost of operating L largely depends on the complexity of the hypothesis, and so the complexity of concept c (as we represent c with h) In the following we define the complexity of the concept c, or more precisely, the complexity of the representation of c
Concept vs. Representation What is the difference between a concept and its representation? Concept: an abstract (eg. mathematical) notion Representation: its concrete representation in a computer programme 1st example: the concept is a logical formula It can be represented by a disjunctive normal form But also by using only NAND operations 2nd example: the concept is a polygon on the plane It can be represented by the series of its vertices But also by the equations of its edges So by “representation” we mean a concrete algorithmic solution In the following we assume that L operates over a representation class H (it selects h from H) H is large enough in the sense that it is able to represent any c C
Efficient PAC learning We define the size of a representation H by some complexity measure for , for example the number of bits required for its storage. We define the size of a concept cC as the size of the smallest representation which is able to represent it: where c means that is a possible representation of c Efficient PAC learning: Let C denote a concept class, and H denote a representation class that is able to represent each element of C. C is efficiently learnable over H if C is PAC-learnable by some algorithm L that operates over H, and L runs in polynomial time in n, size(c), 1/ and 1/ for any cC. Remark: besides efficient training, for efficient applicability it is also required that the evaluation of h(x) should also run in polynomial time in n and size(h) for any arbitrary point x – but usually this is much easier to satisfy in practice
Example 3-term disjunctive normal form: T1vT2vT3, where Ti is the conjuction of literals (variables with or without negation) The representation class of 3-term disjunctive normal forms is not efficiently PAC-learnable. That is, if we want to learn 3-term disjunctive normal forms, and we represent them by themselves, then their PAC learning is not possible However, the concept class of 3-term disjunctive normal forms is efficiently PAC-learnable! That is, the concepts class itself is PAC-learnable, but for this we must use a larger representation class which can be manipulated easier. With a cleverly chosen representation we can solve to keep the operation time of the learning algorithm within the polynomial time limit.
Occam learning Earlier we mentioned Occam’s razor as a general machine learning heuristics: a simpler explanation usually generalizes better. No we define it more formally within the framework of PAC learning. By the simplicity of a hypothesis we will mean its size. As we will see, seeking a small hypothesis will actually mean that the algorithm has to compress the class labels of the training a samples Definition: Let 0 and 0≤β<1 be constants. An L algorithm (that works over hypothesis space H) is an (,β) –Occam-algorithm over the concept class C if, given m samples from cC it is able to find a hypothesis hH for which: h is consistent with the samples size(h)≤(n.size(c)).m (that is, size(h) increasese slowly as a function of size(c) and m ) L is an efficient (,β) –Occam-algorithm if its running time is a polynomial function of n, m and size(c)
Occam learning Why do we say that an Occam algorithm compresses the samples? For a given task, n and size(c) are fixed, and in practice m>>n So we can say that in practice the above formula reduces to size(h)<m, where <1 Hypothesis h is consistent with the m examples, which means that it can exactly return the label of the m examples. We have 0-1 labels, which corresponds to m bits of information We defined size(h) as the number of bits required for storing h, so size(h)<m means that the learner stored m bits of information in h using only m bits (remember that <1), so it stored the labels in a compressed form Theorem: An efficient Occam learner is an efficient PAC learner at the same time Which means that compression (according to a certain definition) guarantees learning (according to a certain definition)!
Sample complexity By sample complexity we mean the number of training samples required for PAC learning. It practice it would be very useful if we could tell in advance the number of training samples required to satisfy a given pair of ε and δ thresholds. Sample complexity in the case of a finite hypothesis space: Theorem: If the hypothesis space H is finite, then any cH concept is PAC-learnable by a consistent learner, and the number of required training samples is: where |H| is the number of hypotheses in H Proof: consistent learner it can only select a consistent hypothesis from H. Thus, the basis of the proof is that the error of all consistent hypotheses can be pushed under the threshold This is the simplest possible proof, but the bound it gives will not be tight (that is, the m we obtain won’t be the smallest possible bound)
Proof Let’s assume that there are k such hypotheses in H for which error>. For a hypothesis h’, the probability of consistency with the m samples is P(h’ is consistent with the m samples)≤ (1-)m For error> the hypothesis h selected for output should be one of the above k hypotheses: P(error(h)>) ≤P(any of the k hypotheses is consistent with the m samples)≤k(1-)m≤|H|(1-)m≤|H|e-m Where we used the inequalities k≤|H| and (1-x)≤e-x. For a given , |H|e-m is an exponentially decreasing function So P(error(h)>) can be pushed below any arbitrarily small δ:
Example Is the threshold obtained from the proof practically usable? Let X be a finite space of n binary features. If we want to use a learner that contains all the possible logical formulas above X, then the size of the hypothesis space is (the number of all possible logical formulas with n variables) Let’s substitute this into the formula : This is exponential in n, so in practice it usually gives an unusable (too large) number for the minimal limit of m There are more complex variants of the above proof which achieve a lower limit for m, but these are also not really usable in practice
The case of infinite hypothesis spaces For the case of finite H we obtained the threshold Unfortunately, the above formula is not usable when |H| is infinite In this case we will use the Vapnik-Chervonenkis dimenison, but before that we have to introduce the concept of “shattering”: Let X be an object space and H a hypothesis space above X (both may be infinite!). Let S an arbitrary finite subset of X. We say that H shatters S, if for any possible {+,-} labeling of S there exists a hH which partitions the points of S according to the labeling. 1st remark: We try to say something about an infinite set by saying something about all of its finite subsets 2nd remark: This measures the representation ability of H, because if H cannot shatter some finite set S, then by extending S we can define a concept over X-en which is not learnable by H
The Vapnik-Chervonenkis dimension The Vapnik-Chervonenkis dimension of H is d if there exists such an S, |S|=d which it can shatter, but it cannot shatter any S for |S|=d+1 (If it can shatter any finite S then VCD=.) Theorem: Let C be a concept class,a and H a representation set for which VCD(H)=d. Let L be a learning algorithm that learns cC by getting a set S of training samples with |S|=m, and it outputs a hypothesis hH which is consistent with S. The learning of C over H is PAC learning if (where c0 is a proper constant) Remark: In contrary to the finite case, the bound obtained here is tight (that is, m samples are not only sufficient, but in certain cases they are necessary as well).
The Vapnik-Chervonenkis dimension Let’s compare the bounds obtained for the finite and infinite cases: Finite case: Infinite case: The two formulas look quite similar, but the role of |H| is taken by the Vapnik-Chervonenkis dimension in the infinite case Both formulas increase relatively slowly as a function of ε and δ, so in this sense these are not bad boundaries…
Examples of VC-dimension Finite intervals over the line: VCD=2 VCD≥2, as these two points can be shattered: (=separated for all label configurations) VCD<3, as no 3 points can be shattered: Separating the two classes by lines on the plane: VCD=3 (in d-dimensional space: VCD=d+1) VCD ≥3, as these 3 points can be shattered: (all labeling configurations should be tried!) VCD<4, as no 4 points can be shattered: (all point arrangements should be tried!)
Examples of VC-dimension Axis-aligned rectangles on the plane: VCD=4 VCD≥4, as these 4 points can be shattered: VCD<5, as no 5 points can be shattered: Convex poligons on the plane: VCD=2d+1 (d is the number of vertices) (the book proofs only one of the directions) Conjunctions of literals over {0,1}n : VCD=n (See Mitchell’s book, only one direction is proved)