The probably approximately correct (PAC) learning model

Slides:



Advertisements
Similar presentations
PAC Learning 8/5/2005. purpose Effort to understand negative selection algorithm from totally different aspects –Statistics –Machine learning What is.
Advertisements

VC Dimension – definition and impossibility result
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Machine Learning Week 2 Lecture 2.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Probably Approximately Correct Model (PAC)
Evaluating Hypotheses
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Machine Learning Chapter 5. Evaluating Hypotheses
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Virtual University of Pakistan
Copyright © Cengage Learning. All rights reserved.
HW HW1: Let us know if you have any questions. ( the TAs)
Chapter 10 Limits and the Derivative
Computational Learning Theory
CS 9633 Machine Learning Concept Learning
Computational Learning Theory
Computational Learning Theory
Introduction to Machine Learning
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Distribution of the Sample Means
CH. 2: Supervised Learning
Vapnik–Chervonenkis Dimension
Hidden Markov Models Part 2: Algorithms
Depth Estimation via Sampling
The
Support Vector Machines Most of the slides were taken from:
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Gaussian Mixture Models And their training with the EM algorithm
Computational Learning Theory
Ensemble learning.
Feature space tansformation methods
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
TECHNIQUES OF INTEGRATION
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
Distribution-Free Procedures
Supervised machine learning: creating a model
INTRODUCTION TO Machine Learning 3rd Edition
Chapter 2 Limits and the Derivative
Presentation transcript:

The probably approximately correct (PAC) learning model

PAC learnability A formal, mathematical model of learnability It origins from a paper by Valiant (“A theory of the learnable”, 1984) It is very theoretical Has only very few results that are usable in practice It formalizes the concept learning task as follows: X: the space from which of the objects (examples) come. E.g. if the objects are described by two continuous features, then X=R2 c is a concept: cX. but c can also be interpreted as an X{0, 1} mapping (each point of X either belongs to c or not) C concept class: a set of concepts. In the following we always assume that the c concept we have to learn is a member of C

PAC learnability L denotes the learning algorithm: its goal is to learn a given c. As its output, it return a hypothesis h from concept class C. As a help, it has access to a function Ex(c,D), which given training examples in the form of <x,c(x)> pairs. The examples are random, independent, and follow a fixed (but arbitrary) probability distribution D. Example: X=R2 (the 2D plane) C=all such rectangles of the plain that are parallel with the axes c: one fixed rectangle h: another given rectangle D: a probability distribution defined over the plain Ex(c,D): It gives positive and negative examples from c

Defining the error After learning, how can we measure the final error of h? We could define it as error(h)=c∆h=(c\h)(h\c) (the symmetric difference of c and h – the size of the blue area in the figure) It is not good - why? D is the probability of picking a given point of the plain randomly We would like our method to work for any possible distribution D If D is 0 in a given area, we won’t get samples from there  we cannot learn in this area So we cannot guarantee error(h)=c∆h to become 0 for any arbitrary D However, D is the same during testing we won’t get samples from that area during testing either  no problem if we couldn’t learn there!

Defining the error Solution: let’s define the error as c∆h, but weight it by D! That is, instead of c∆h, we calculate the area under D on c∆h „Sparser” areas: they are harder to learn, but also count less in the error, as the weight, that is the value of D is smaller there As D is the probability distribution of picking the points of X, the shaded are under D is the probability of randomly picking a point from c∆h. Because of this, the error of h is defined as:

PAC learnability - definition Let C be concept class over X. C is PAC learnable, if there exists an algorithm L that satisfies the following: using the samples given by Ex(c,D) for any concept c  C, any distribution D(x) and any constants 0<ε<1/2, 0<δ<1/2, with probability P≥1-δ it finds a hypothesis h  C for which error(h)≤ε. Briefly: With a large probability, L finds a hypothesis with a small error (this is why L is „probably approximately correct”). 1. remark: the definition does not help find L, it just lists the conditions it has to satisfy  2. remark: notice that ε and δ can be arbitrarily close to 0 3. remark: it should work for any arbitrarily chosen c  C and distribution D(x)

Why „probably approximately” ? Why do we need the two thresholds, ε and δ ? We have to allow the error to be ε because the set of examples is always finite, and so we cannot guarantee the perfect learning of a concept. The simplest example is when X is infinite, as in this case a finite amount of examples won’t ever be able to perfectly cover each point of an infinite space. Allowing a large error with probability δ is required because we cannot guarantee the error to become small for each run. For example, it may happen that each randomly picked training sample will be the same point – and we won’t be able to learn from just one sample. This is why we can guarantee to find a good hypothesis only with a certain probability.

Example X=R2 (The training examples are the points of the plane) C=the set of those rectangles which are parallel with the axes c: one given rectangle h: another fixed rectangle error(h): the area of h∆c weighted with D(x) Ex(c,D): it gives positive and negative examples Let L be the algorithm that always returns as h the smallest rectangle that contains all the positive examples! Notice that h  c As the number of examples increases, the error can only decrease

Examples We prove that by increasing the number of examples the error becomes smaller and smaller with an increasing probability; that is, algorithm L learns C in a PAC sense Let 0<ε, δ≤1/2 be fixed Consider hypothesis h’ obtained as shown in the figure: ε/4 is the probability that a randomly pulled sample falls in T Then, 1- ε/4 is the probability that a randomly pulled sample does NOT fall in T The probability that m randomly pulled samples do NOT fall in T : The weighted area of the four bands is <ε (there are overlaps) So if we already pulled samples from T on each side, then error(h)< ε for the actual hypothesis h

Example So error(h) ≥ ε is possible only if we haven’t yet pulled a sample from T on at least one of the sides error(h) ≥ ε  no sample from T1, T2, T3 or T4 P(error(h) ≥ ε) ≤ P(no sample from T1, T2, T3 or T4) ≤ So we got this: P(error(h) ≥ ε) ≤ We want to push this probability under a threshold δ: How does the above curve look like as a function of m (ε is fixed)? It is an exponential function with a base between 0 and 1 It goes to 0 if m goes to infinity So it can be pushed under arbitrarily small δ by selecting a large enough m

Example – expressing m We already proved that L is a PAC learner, but as an addition, we can also specify m as a function of ε and δ: We make use of the formula 1-x≤e-x in defining a threshold: That is, for a given ε and δ if m is larger than the threshold above, then P(error(h) ≥ ε) ≤ δ Notice that m grows relatively slowly as a function of ε and δ , which might be important for practical usability

Efficient PAC learning For practical considerations it would be good if m increased slowly with the decrease of ε and δ „slowly” = polynomially Besides the number of training samples, whet else may influence the speed of training? The processing cost of the training samples usually depends on the number of features, n During its operation, the algorithm modifies its actual hypothesis h after receiving each training sample. Because of this, the cost of operating L largely depends on the complexity of the hypothesis, and so the complexity of concept c (as we represent c with h) In the following we define the complexity of the concept c, or more precisely, the complexity of the representation of c

Concept vs. Representation What is the difference between a concept and its representation? Concept: an abstract (eg. mathematical) notion Representation: its concrete representation in a computer programme 1st example: the concept is a logical formula It can be represented by a disjunctive normal form But also by using only NAND operations 2nd example: the concept is a polygon on the plane It can be represented by the series of its vertices But also by the equations of its edges So by “representation” we mean a concrete algorithmic solution In the following we assume that L operates over a representation class H (it selects h from H) H is large enough in the sense that it is able to represent any c  C

Efficient PAC learning We define the size of a representation H by some complexity measure for , for example the number of bits required for its storage. We define the size of a concept cC as the size of the smallest representation which is able to represent it: where  c means that  is a possible representation of c Efficient PAC learning: Let C denote a concept class, and H denote a representation class that is able to represent each element of C. C is efficiently learnable over H if C is PAC-learnable by some algorithm L that operates over H, and L runs in polynomial time in n, size(c), 1/ and 1/ for any cC. Remark: besides efficient training, for efficient applicability it is also required that the evaluation of h(x) should also run in polynomial time in n and size(h) for any arbitrary point x – but usually this is much easier to satisfy in practice

Example 3-term disjunctive normal form: T1vT2vT3, where Ti is the conjuction of literals (variables with or without negation) The representation class of 3-term disjunctive normal forms is not efficiently PAC-learnable. That is, if we want to learn 3-term disjunctive normal forms, and we represent them by themselves, then their PAC learning is not possible However, the concept class of 3-term disjunctive normal forms is efficiently PAC-learnable! That is, the concepts class itself is PAC-learnable, but for this we must use a larger representation class which can be manipulated easier. With a cleverly chosen representation we can solve to keep the operation time of the learning algorithm within the polynomial time limit.

Occam learning Earlier we mentioned Occam’s razor as a general machine learning heuristics: a simpler explanation usually generalizes better. No we define it more formally within the framework of PAC learning. By the simplicity of a hypothesis we will mean its size. As we will see, seeking a small hypothesis will actually mean that the algorithm has to compress the class labels of the training a samples Definition: Let 0 and 0≤β<1 be constants. An L algorithm (that works over hypothesis space H) is an (,β) –Occam-algorithm over the concept class C if, given m samples from cC it is able to find a hypothesis hH for which: h is consistent with the samples size(h)≤(n.size(c)).m (that is, size(h) increasese slowly as a function of size(c) and m ) L is an efficient (,β) –Occam-algorithm if its running time is a polynomial function of n, m and size(c)

Occam learning Why do we say that an Occam algorithm compresses the samples? For a given task, n and size(c) are fixed, and in practice m>>n So we can say that in practice the above formula reduces to size(h)<m, where <1 Hypothesis h is consistent with the m examples, which means that it can exactly return the label of the m examples. We have 0-1 labels, which corresponds to m bits of information We defined size(h) as the number of bits required for storing h, so size(h)<m means that the learner stored m bits of information in h using only m bits (remember that <1), so it stored the labels in a compressed form Theorem: An efficient Occam learner is an efficient PAC learner at the same time Which means that compression (according to a certain definition) guarantees learning (according to a certain definition)!

Sample complexity By sample complexity we mean the number of training samples required for PAC learning. It practice it would be very useful if we could tell in advance the number of training samples required to satisfy a given pair of ε and δ thresholds. Sample complexity in the case of a finite hypothesis space: Theorem: If the hypothesis space H is finite, then any cH concept is PAC-learnable by a consistent learner, and the number of required training samples is: where |H| is the number of hypotheses in H Proof: consistent learner  it can only select a consistent hypothesis from H. Thus, the basis of the proof is that the error of all consistent hypotheses can be pushed under the threshold  This is the simplest possible proof, but the bound it gives will not be tight (that is, the m we obtain won’t be the smallest possible bound)

Proof Let’s assume that there are k such hypotheses in H for which error>. For a hypothesis h’, the probability of consistency with the m samples is P(h’ is consistent with the m samples)≤ (1-)m For error> the hypothesis h selected for output should be one of the above k hypotheses: P(error(h)>) ≤P(any of the k hypotheses is consistent with the m samples)≤k(1-)m≤|H|(1-)m≤|H|e-m Where we used the inequalities k≤|H| and (1-x)≤e-x. For a given , |H|e-m is an exponentially decreasing function So P(error(h)>) can be pushed below any arbitrarily small δ:

Example Is the threshold obtained from the proof practically usable? Let X be a finite space of n binary features. If we want to use a learner that contains all the possible logical formulas above X, then the size of the hypothesis space is (the number of all possible logical formulas with n variables) Let’s substitute this into the formula : This is exponential in n, so in practice it usually gives an unusable (too large) number for the minimal limit of m  There are more complex variants of the above proof which achieve a lower limit for m, but these are also not really usable in practice 

The case of infinite hypothesis spaces For the case of finite H we obtained the threshold Unfortunately, the above formula is not usable when |H| is infinite In this case we will use the Vapnik-Chervonenkis dimenison, but before that we have to introduce the concept of “shattering”: Let X be an object space and H a hypothesis space above X (both may be infinite!). Let S an arbitrary finite subset of X. We say that H shatters S, if for any possible {+,-} labeling of S there exists a hH which partitions the points of S according to the labeling. 1st remark: We try to say something about an infinite set by saying something about all of its finite subsets 2nd remark: This measures the representation ability of H, because if H cannot shatter some finite set S, then by extending S we can define a concept over X-en which is not learnable by H

The Vapnik-Chervonenkis dimension The Vapnik-Chervonenkis dimension of H is d if there exists such an S, |S|=d which it can shatter, but it cannot shatter any S for |S|=d+1 (If it can shatter any finite S then VCD=.) Theorem: Let C be a concept class,a and H a representation set for which VCD(H)=d. Let L be a learning algorithm that learns cC by getting a set S of training samples with |S|=m, and it outputs a hypothesis hH which is consistent with S. The learning of C over H is PAC learning if (where c0 is a proper constant) Remark: In contrary to the finite case, the bound obtained here is tight (that is, m samples are not only sufficient, but in certain cases they are necessary as well).

The Vapnik-Chervonenkis dimension Let’s compare the bounds obtained for the finite and infinite cases: Finite case: Infinite case: The two formulas look quite similar, but the role of |H| is taken by the Vapnik-Chervonenkis dimension in the infinite case Both formulas increase relatively slowly as a function of ε and δ, so in this sense these are not bad boundaries…

Examples of VC-dimension Finite intervals over the line: VCD=2 VCD≥2, as these two points can be shattered: (=separated for all label configurations) VCD<3, as no 3 points can be shattered: Separating the two classes by lines on the plane: VCD=3 (in d-dimensional space: VCD=d+1) VCD ≥3, as these 3 points can be shattered: (all labeling configurations should be tried!) VCD<4, as no 4 points can be shattered: (all point arrangements should be tried!)

Examples of VC-dimension Axis-aligned rectangles on the plane: VCD=4 VCD≥4, as these 4 points can be shattered: VCD<5, as no 5 points can be shattered: Convex poligons on the plane: VCD=2d+1 (d is the number of vertices) (the book proofs only one of the directions) Conjunctions of literals over {0,1}n : VCD=n (See Mitchell’s book, only one direction is proved)