Computational Learning Theory

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Evaluating Hypotheses
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
A New Linear-threshold Algorithm Anna Rapoport Lev Faivishevsky.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.
Overview Concept Learning Representation Inductive Learning Hypothesis
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Machine Learning Concept Learning General-to Specific Ordering
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Concept Learning and The General-To Specific Ordering
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Machine Learning: Ensemble Methods
The Theory of NP-Completeness
The NP class. NP-completeness
P & NP.
CS 9633 Machine Learning Support Vector Machines
Evaluating Hypotheses
Chapter 2 Concept Learning
Chapter 7. Classification and Prediction
Visual Recognition Tutorial
Computational Learning Theory
CS 9633 Machine Learning Concept Learning
Computational Learning Theory
Computational Learning Theory
Introduction to Machine Learning
CH. 2: Supervised Learning
Data Mining Lecture 11.
Vapnik–Chervonenkis Dimension
Analysis and design of algorithm
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Perceptron as one Type of Linear Discriminants
Summarizing Data by Statistics
Computational Learning Theory
Chapter 11 Limitations of Algorithm Power
The probably approximately correct (PAC) learning model
Ensemble learning.
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CS344 : Introduction to Artificial Intelligence
Machine Learning: UNIT-3 CHAPTER-2
Machine Learning: Lecture 6
Supervised Learning Berlin Chen 2005 References:
The Theory of NP-Completeness
Machine Learning: UNIT-3 CHAPTER-1
Machine Learning Chapter 2
Supervised machine learning: creating a model
Implementation of Learning Systems
Version Space Machine Learning Fall 2018.
INTRODUCTION TO Machine Learning 3rd Edition
Machine Learning Chapter 2
Machine Learning: Lecture 5
Presentation transcript:

Computational Learning Theory

A theory of the learnable (Valiant ‘84) […] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn […] Learning machines must have all 3 of the following properties: the machines can provably learn whole classes of concepts, these classes can be characterized the classes of concepts are appropriate and nontrivial for general- purpose knowledge the computational process by which the machine builds the desired programs requires a “feasible” (i.e. polynomial) number of steps

A theory of the learnable We seek general laws that constrain inductive learning, relating: Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented

Overview Are there general laws that govern learning? Sample Complexity: How many training examples are needed for a learner to converge (with high probability) to a successful hypothesis? Computational Complexity: How much computational effort is needed for a learner to converge (with high probability) to a successful hypothesis? Mistake Bound: How many training examples will the learner misclassify before converging to a successful hypothesis? These questions will be answered within two analytical frameworks: The Probably Approximately Correct (PAC) framework The Mistake Bound framework

Overview (Cont’d) Rather than answering these questions for individual learners, we will answer them for broad classes of learners. In particular we will consider: The size or complexity of the hypothesis space considered by the learner. The accuracy to which the target concept must be approximated. The probability that the learner will output a successful hypothesis. The manner in which training examples are presented to the learner.

Introduction Problem setting Focus on: Inductively learning an unknown target function, given training examples and a hypothesis space Focus on: How many training examples are sufficient? How many mistakes will the learner make before it succeeds?

Introduction (2) Desirable: quantitative bounds depending on Complexity of hypo space, Accuracy of approximation to the target Probability of outputting a successful hypo How the training examples are presented Learner proposes instances Teacher presents instances Some random process produces instances Specifically, study sample complexity, computational complexity, and mistake bound. Focus on broad class of algorithms rather than individual ones

Problem Setting Space of possible instances X (e.g. set of all people) over which target functions may be defined. Assume that different instances in X may be encountered with different frequencies. Modeling above assumption as: unknown (stationary) probability distribution D that defines the probability of encountering each instance in X Training examples are provided by drawing instances independently from X, according to D, and they are noise-free. Each element c in target function set C corresponds to certain subset of X, i.e. c is a Boolean function. (Just for the sake of simplicity) Worst case analysis

Error of a Hypothesis Training error of hypo h w.r.t. target function c and training data set S of n sample is True error of hypo h w.r.t. target function c and distribution D is errorD(h) is not observable, so how probable is it that errorS(h) gives a misleading estimates of errorD(h)? Different from problem setting in Ch5, where samples are drawn independently from h, here h depends on training samples. H should be discrete-valued

An Illustration of True Error

Theoretical Questions of Interest Is it possible to identify classes of learning problems that are inherently difficult or easy, independent of the learning algorithm? Can one characterize the number of training examples necessary or sufficient to assure successful learning? How is the number of examples affected If observing a random sample of training data? if the learner is allowed to pose queries to the trainer? Can one characterize the number of mistakes that a learner will make before learning the target function? Can one characterize the inherent computational complexity of a class of learning algorithms?

Computational Learning Theory Relatively recent field Area of intense research Partial answers to some questions on previous page is yes. Will generally focus on certain types of learning problems.

Inductive Learning of Target Function What we are given Hypothesis space Training examples What we want to know How many training examples are sufficient to successfully learn the target function? How many mistakes will the learner make before succeeding?

Computational Learning Theory Provides a theoretical analysis of learning: Is it possible to identify classes of learning problems that are inherently difficult/easy? Can we characterize the computational complexity of classes of learning problems When a learning algorithm can be expected to succeed When learning may be impossible Can we characterize the number of training samples necessary/sufficient for successful learning? How is this number affected if we allow the learner to ask questions (active learning) How many mistakes will the learner make before learning the target function

Computational Learning Theory Quantitative bounds can be set depending on the following attributes: Accuracy to which the target must be approximated The probability that the learner will output a successful hypothesis Size or complexity of the hypothesis space considered by the learner The manner in which training examples are presented to the learner

Computational Learning Theory Three general areas: Sample Complexity. How many examples we need to find a good hypothesis? Computational Complexity. How much computational power we need to find a good hypothesis? Mistake Bound. How many mistakes we will make before finding a good hypothesis?

Sample Complexity How Many Training Examples Sufficient To Learn Target Concept? Scenario 1: Active Learning Learner proposes instances, as queries to teacher Query (learner): instance x Answer (teacher): c(x) Scenario 2: Passive Learning from Teacher-Selected Examples Teacher (who knows c) provides training examples Sequence of examples (teacher): {<xi, c(xi)>} Teacher may or may not be helpful, optimal Scenario 3: Passive Learning from Teacher-Annotated Examples Random process (e.g., nature) proposes instances Instance x generated randomly, teacher provides c(x)

Models of Learning Learner: who is doing the learning? (e.g. A computer with limited resources (finite memory, polynomial time,...) Domain: What is being learnt? (e.g. Concept of a chair) Information source: - Examples - positive/negative - according to a certain distribution - selected how? - features? Queries - “is this a chair?” - Experimentation - play with a new gadget to learn how it works Noisy or noise-free? Prior knowledge: e.g. “The concept to learn is a conjunction of features” Performance criteria: Measure of how well learned? Done? Accuracy (error rate) Efficiency

Computational Learning Theory The PAC Learning Framework Finite Hypothesis Spaces Examples of PAC Learnable Concepts VC dimension & Infinite Hyp. Spaces The Mistake Bound Model What is machine learning?

Two Frameworks PAC (Probably Approximately Correct) Learning Framework: Identify classes of hypotheses that can and cannot be learned from a polynomial number of training examples Define a natural measure of complexity for hypothesis spaces that allows bounding the number of training examples needed Mistake Bound Framework

PAC Learning Probably Approximately Correct Learning Model Will restrict discussion to learning boolean-valued concepts in noise-free data.

Problem Setting: Instances and Concepts X is set of all possible instances over which target function may be defined C is set of target concepts learner is to learn Each target concept c in C is a subset of X Each target concept c in C is a boolean function c: X{0,1} c(x) = 1 if x is positive example of concept c(x) = 0 otherwise

Problem Setting: Distribution Instances generated at random using some probability distribution D D may be any distribution D is generally not known to the learner D is required to be stationary (does not change over time) Training examples x are drawn at random from X according to D and presented with target value c(x) to the learner.

Problem Setting: Hypotheses Learner L considers set of hypotheses H After observing a sequence of training examples of the target concept c, L must output some hypothesis h from H which is its estimate of c

Example Problem (Classifying Executables) Three Classes (Malicious, Boring, Funny) Features a1 GUI present (yes/no) a2 Deletes files (yes/no) a3 Allocates memory (yes/no) a4 Creates new thread (yes/no) Distribution? Hypotheses?

Instance a1 a2 a3 a4 Class 1 Yes No B 2 3 F 4 M 5 6 7 8 9 10

Computer Science Department True Error Definition: The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D , is the probability that h will misclassify an instance drawn at random according to D. Computer Science Department CS 9633 Machine Learning

Error of h with respect to c Instance space X - - - c + + h + - Computer Science Department CS 9633 Machine Learning

Computer Science Department Key Points True error defined over entire instance space, not just training data Error depends strongly on the unknown probability distribution D The error of h with respect to c is not directly observable to the learner L—can only observe performance with respect to training data (training error) Question: How probable is it that the observed training error for h gives a misleading estimate of the true error? Computer Science Department CS 9633 Machine Learning

Computer Science Department PAC Learnability Goal: characterize classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and using a reasonable amount of computation Unreasonable to expect perfect learning where errorD(h) = 0 Would need to provide training examples corresponding to every possible instance With random sample of training examples, there is always a non-zero probability that the training examples will be misleading Computer Science Department CS 9633 Machine Learning

Weaken Demand on Learner Hypothesis error (Approximately) Will not require a zero error hypothesis Require that error is bounded by some constant , that can be made arbitrarily small  is the error parameter Error on training data (Probably) Will not require that the learner succeed on every sequence of randomly drawn training examples Require that its probability of failure is bounded by a constant, , that can be made arbitrarily small  is the confidence parameter Computer Science Department CS 9633 Machine Learning

Probably Approximately Correct Learning (PAC Learning)

Computational Learning Theory Three general areas: Sample Complexity. How many examples we need to find a good hypothesis? Computational Complexity. How much computational power we need to find a good hypothesis? Mistake Bound. How many mistakes we will make before finding a good hypothesis?

Two Frameworks PAC (Probably Approximately Correct) Learning Framework: Identify classes of hypotheses that can and cannot be learned from a polynomial number of training examples Define a natural measure of complexity for hypothesis spaces that allows bounding the number of training examples needed Mistake Bound Framework

Cannot Learn Exact Concepts from Limited Data, Only Approximations Positive Learner Classifier Negative Positive Negative Wrong! Right!

Cannot Learn Even Approximate Concepts from Pathological Training Sets Positive Learner Classifier Negative Negative Positive Wrong!

Probably approximately correct learning formal computational model which want shed light on the limits of what can be learned by a machine, analysing the computational cost of learning algorithms

What we want to learn CONCEPT = recognizing algorithm LEARNING = computational description of recognizing algorithms starting from: - examples - incomplete specifications That is: to determine uniformly good approximations of an unknown function from its value in some sample points interpolation pattern matching concept learning

What’s new in p.a.c. learning? Accuracy of results and running time for learning algorithms are explicitly quantified and related A general problem: use of resources (time, space…) by computations  COMPLEXITY THEORY Example Sorting: n·logn time (polynomial, feasible) Bool. satisfiability: 2ⁿ time (exponential, intractable)

PAC Learnability PAC refers to Probably Approximately Correct It is desirable that errorD(h) to be zero, however, to be realistic, we weaken our demand in two ways: errorD(h) is to be bounded by a small number ε Learner is not required to success on every training sample, rather that its probability of failure is to be bounded by a constant δ. Hence we come up with the idea of “Probably Approximately Correct”

PAC Learning The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept. In the PAC model, we specify two small parameters, ε and δ, and require that with probability at least (1  δ) a system learn a concept with error at most ε.

The PAC Learning Framework Definition: A class of concepts C is PAC learnable using a hypothesis class H, if there exist a learning algorithm L such that for arbitrary small δ and ε, and for all concepts c in C, and for all distributions D over the input space, there is a 1-δ probability that the hypothesis h selected from space H by L is approximately correct (has less than ε true error). (Valiant 1984)

Definition of PAC-Learnability Definition: Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis space H. C is PAC-learnable by L using H if all c  C, distributions D over X,  such that 0 <  < ½ , and  such that 0 <  < ½, learner L will with probability at least (1 - ) output a hypothesis h H such that errorD(h)  , in time that is polynomial in 1/, 1/, n, and size(c). Computer Science Department CS 9633 Machine Learning

Requirements of Definition L must with arbitrarily high probability (1-), out put a hypothesis having arbitrarily low error (). L’s learning must be efficient—grows polynomially in terms of Strengths of output hypothesis (1/, 1/) Inherent complexity of instance space (n) and concept class C (size(c)). Computer Science Department CS 9633 Machine Learning

Block Diagram of PAC Learning Model Control Parameters ,  Training sample Hypothesis h Learning algorithm L Computer Science Department CS 9633 Machine Learning

Examples of second requirement Consider executables problem where instances are conjunctions of boolean features: a1=yes  a2=no  a3=yes  a4=no Concepts are conjunctions of a subset of the features a1=yes  a3=yes  a4=yes

Using the Concept of PAC Learning in Practice We often want to know how many training instances we need in order to achieve a certain level of accuracy with a specified probability. If L requires some minimum processing time per training example, then for C to be PAC-learnable by L, L must learn from a polynomial number of training examples. Computer Science Department CS 9633 Machine Learning

Sample Complexity for Finite Hypothesis Spaces

Sample Complexity for Finite Hypothesis Spaces Start from a good class of learner—consistent learner, defined as one that outputs a hypo which perfectly fits the training data set, whenever possible. Recall: Version space VSH,D is defined to be the set of all hypo h∈H that correctly classify all training examples in D. Property. Every consistent learner outputs a hypo belonging to version space. Why say “whenever possible”?

Sample Complexity for Finite Hypothesis Spaces Given any consistent learner, the number of examples sufficient to assure that any hypothesis will be probably (with probability (1- )) approximately (within error  ) correct is m= 1/ (ln|H|+ln(1/)) If the learner is not consistent, m= 1/22 (ln|H|+ln(1/)) Conjunctions of Boolean Literals are also PAC-Learnable and m= 1/ (n.ln3+ln(1/)) k-term DNF expressions are not PAC learnable because even though they have polynomial sample complexity, their computational complexity is not polynomial. Surprisingly, however, k-term CNF is PAC learnable.

Formal Definition of PAC-Learnable Consider a concept class C defined over an instance space X containing instances of length n, and a learner, L, using a hypothesis space, H. C is said to be PAC-learnable by L using H iff for all cC, distributions D over X, 0<ε<0.5, 0<δ<0.5; learner L by sampling random examples from distribution D, will with probability at least 1 δ output a hypothesis hH such that errorD(h) ε, in time polynomial in 1/ε, 1/δ, n and size(c). Example: X: instances described by n binary features C: conjunctive descriptions over these features H: conjunctive descriptions over these features L: most-specific conjunctive generalization algorithm (Find-S) size(c): the number of literals in c (i.e. length of the conjunction).

ε-exhausted Def. VSH,D is said to be ε-exhausted w.r.t. c and D if for any h in VSH,D, errorD(h)<ε.

A PAC-Learnable Example Consider class C of conjunction of boolean literals. A boolean literal is any boolean variable or its negation Q: Is such C PAC-learnable? A: Yes, by going through the following two steps: Show that any consistent learner will require only a polynomial number of training examples to learn any element of C Suggest a specific algorithm that use polynomial time per training example.

Contd Step1: Let H consist of conjunction of literals based on n boolean variables. Now take a look at m≥(1/ε)(ln|H|+ln(1/δ)), observe that |H|=3n, then the inequality becomes m≥(1/ε)(nln3+ln(1/δ)). Step2: FIND-S algorithm satisfies the requirement For each new positive training example, the algorithm computes intersection of literals shared by current hypothesis and the example, using time linear in n

Sample Complexity of Conjunction Learning Consider conjunctions over n boolean features. There are 3n of these since each feature can appear positively, appear negatively, or not appear in a given conjunction. Therefore |H|= 3n, so a sufficient number of examples to learn a PAC concept is: Concrete examples: δ=ε=0.05, n=10 gives 280 examples δ=0.01, ε=0.05, n=10 gives 312 examples δ=ε=0.01, n=10 gives 1,560 examples δ=ε=0.01, n=50 gives 5,954 examples Result holds for any consistent learner, including FindS.

Sample Complexity of Learning Arbitrary Boolean Functions Consider any boolean function over n boolean features such as the hypothesis space of DNF or decision trees. There are 22^n of these, so a sufficient number of examples to learn a PAC concept is: Concrete examples: δ=ε=0.05, n=10 gives 14,256 examples δ=ε=0.05, n=20 gives 14,536,410 examples δ=ε=0.05, n=50 gives 1.561x1016 examples

Agnostic Learning & Inconsistent Hypo we assume that VSH,D is not empty, and a simple way to guarantee such condition holds is that we assume that c belongs to H. Agnostic learning setting: Don’t assume c∈H, and the learner simply finds hypo with minimum training error instead.

Sample Complexity for Infinite Hypothesis Spaces

Infinite Hypothesis Spaces The preceding analysis was restricted to finite hypothesis spaces. Some infinite hypothesis spaces (such as those including real-valued thresholds or parameters) are more expressive than others. Compare a rule allowing one threshold on a continuous feature (length<3cm) vs one allowing two thresholds (1cm<length<3cm). Need some measure of the expressiveness of infinite hypothesis spaces. The Vapnik-Chervonenkis (VC) dimension provides just such a measure, denoted VC(H). Analagous to ln|H|, there are bounds for sample complexity using VC(H).

VC Dimension An unbiased hypothesis space shatters the entire instance space. The larger the subset of X that can be shattered, the more expressive the hypothesis space is, i.e. the less biased. The Vapnik-Chervonenkis dimension, VC(H). of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite subsets of X can be shattered then VC(H) =  If there exists at least one subset of X of size d that can be shattered then VC(H) ≥ d. If no subset of size d can be shattered, then VC(H) < d. For a single intervals on the real line, all sets of 2 instances can be shattered, but no set of 3 instances can, so VC(H) = 2. Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|

Shattering a Set of Instances Def. A dichotomy of a set S is a partition of S into two disjoint subsets Def. A set of instances S is shattered by hypo space H iff for every dichotomy of S, there exists some hypo in H consistent with this dichotomy. 3 instances shattered

VC Dimension We say that a set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H. If there exists a subset of size d can be shattered, then VC(H) >=d If no subset of size d can be shattered, then VC(H) < d VC(Half intervals) = 1 (no subset of size 2 can be shattered) VC( Intervals) = 2 (no subset of size 3 can be shattered) VC(Half-spaces in the plane) = 3 (no subset of size 4 can be shattered) Computational Learning Theory CS446-Spring 06

VC Dimension Motivation: What if H can’t shatter X? Try finite subsets of X. Def. VC dimension of hypo space H defined over instance space X is the size of largest finite subset of X shattered by H. If any arbitrarily large finite subsets of X can be shattered by H, then VC(H)≡∞ Roughly speaking, VC dimension measures how many (training) points can be separated for all possible labeling using functions of a given class. Notice that for any finite H, VC(H)≤log2|H|

Sample Complexity for Infinite Hypothesis Spaces II Upper-Bound on sample complexity, using the VC-Dimension: m 1/ (4log2(2/)+8VC(H)log2(13/) Lower Bound on sample complexity, using the VC-Dimension: Consider any concept class C such that VC(C)  2, any learner L, and any 0 <  < 1/8, and 0 <  < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than max[1/ log(1/ ),(VC(C)-1)/(32)] then with probability at least , L outputs a hypothesis h having errorD(h)>  .

An Example: Linear Decision Surface Line case: X=real number set, and H=set of all open intervals, then VC(H)=2. Plane case: X=xy-plane, and H=set of all linear decision surface of the plane, then VC(H)=3. General case: For n-dim real-number space, let H be its linear decision surface, then VC(H)=n+1.

Sample Complexity from VC Dimension How many randomly drawn examples suffice to ε-exhaust VSH,D with probability at least 1-δ? (Blumer et al. 1989) Furthermore, it is possible to obtain a lower bound on sample complexity (i.e. minimum number of required training samples)

Lower Bound on Sample Complexity Theorem 7.2 (Ehrenfeucht et al. 1989) Consider any concept class C s.t. VC(C)≥2, any learner L, and any 0<ε<1/8, and 0<δ<1/100. Then there exists a distribution D and target concept in C s.t. if L observes fewer examples than max[(1/ε)log(1/δ), (VC(C)-1)/(32ε)], then with probability at least δ, L outputs a hypo h having errorD(h)>ε.

VC-Dimension for Neural Networks Let G be a layered directed acyclic graph with n input nodes and s2 internal nodes, each having at most r inputs. Let C be a concept class over Rr of VC dimension d, corresponding to the set of functions that can be described by each of the s internal nodes. Let CG be the G-composition of C, corresponding to the set of functions that can be represented by G. Then VC(CG)2ds log(es), where e is the base of the natural logarithm. This theorem can help us bound the VC-Dimension of a neural network and thus, its sample complexity

Mistake Bound Model

Mistake Bound Model The learner receives a sequence of training examples Instance based learning Upon receiving each example x, the learner must predict the target value c(x) Online learning How many mistakes will the learner make before it learns the target concept? e.g. Learning fraudulent credit card purchases

Mistake Bound Model

Given that the VS initially contains |H| hypotheses, When the majority of the hypotheses incorrectly classifies the new example, the VS will be reduced to at most half its current size Given that the VS initially contains |H| hypotheses, the maximum number of mistakes possible before VS contains just one member is log2|H| The algorithm can learn without any mistakes at all when the majority is correct, it will remove the incorrect, minority hypotheses

Skip We may also ask what is the Optimal Mistake bound (Opt(C))? lowest worst-case mistake bound over all possible learning algorithms VC(C) < Opt(C) < MHalving(C) < log2|C|

Introduction (2) Desirable: quantitative bounds depending on Complexity of hypo space, Accuracy of approximation Probability of outputting a successful hypo How the training examples are presented Learner proposes instances Teacher presents instances Some random process produces instances Specifically, study sample complexity, computational complexity, and mistake bound. Focus on broad class of algorithms rather than individual ones

Introduction to “Mistake Bound” Mistake bound: the total number of mistakes a learner makes before it converges to the correct hypothesis Assume the learner receives a sequence of training examples, however, for each instance x, the learner must first predict c(x) before it receives correct answer from the teacher. Application scenario: when the learning must be done on-the-fly, rather than during off-line training stage.

The Mistake Bound Model of Learning The Mistake Bound framework is different from the PAC framework as it considers learners that receive a sequence of training examples and that predict, upon receiving each example, what its target value is. The question asked in this setting is: “How many mistakes will the learner make in its predictions before it learns the target concept?” This question is significant in practical settings where learning must be done while the system is in actual use.

Theorem 1. Online learning of conjunctive concepts can be done with at most n+1 prediction mistakes.

Find-S Algorithm Finding-S: Find a maximally specific hypothesis Initialize h to the most specific hypothesis in H For each positive training example x For each attribute constraint ai in h, if it is satisfied by x, then do nothing; otherwise replace ai by the next more general constraint that is satisfied by x. Output hypo h

Mistake Bound for FIND-S Assume training data is noise-free and target concept c is in the hypo space H, which consists of conjunction of up to n boolean literals Then in the worst case the learner needs to make n+1 mistakes before it learns c Note that misclassification occurs only in case that the latest learned hypo misclassifies a positive example as negative, and one such mistake removes at least one constraint from the hypo, and in the above worst case, c is the function that assigns every instance to “true” value

Mistake Bound for Halving Algorithm Halving algorithm = incrementally learning the version space as every new instance arrives + predict a new instance by majority votes (of hypo in VS) Q: What is the maximum number of mistakes that can be made by a halving algorithm, for an arbitrary finite H, before it exactly learns the target concept c (assume c is in H)? Answer: the largest integer no more than log2|H| How about the minimum number of mistakes? Answer: zero-mistake! 9→4→2→1

Optimal Mistake Bounds For an arbitrary concept class C, assuming H=C, interested in the lowest worst-case mistake bound over all possible learning algorithms Let MA(c) denotes the maximum number of mistakes over all possible training examples that a learner A makes to exactly learn c. Def. MA(C) ≡maxc∈CMA(c) Ex: MFIND-s(C)=n+1, MHalving(C)≤log2|C|

Optimal Mistake Bounds (2) The optimal mistake bound for C, denoted by Opt(C), defined as minA∈learning algMA(C) Notice that Opt(C)≤MHalving(C)≤log2|C| Furthermore, Littlestone (1987) shows that VC(C)≤Opt(C) ! When C equal to the power-set Cp of any finite instance space X, the above four quantities become equal to each other, i.e. |X|

Optimal Mistake Bounds Definition: Let C be an arbitrary nonempty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of MA(C). Opt(C)=minALearning_Algorithm MA(C) For any concept class C, the optimal mistake bound is bound as follows: VC(C)  Opt(C)  log2(|C|)

Weighted-Majority Algorithm It is a generalization of Halving algorithm: makes a prediction by taking a weighted vote among a pool of prediction algorithms (or hypotheses) and learns by altering the weights It starts by assigning equal weight (=1) to every prediction algorithm. Whenever an algorithm misclassifies a training example, reduces its weight Halving algorithm reduces the weight to zero

Procedure for Adjusting Weights ai denotes the ith prediction algorithm in the pool; wi denotes the weight of ai, and is initialized to 1 For each training example <x, c(x)> Initialize q0 & q1 to be 0 For each ai, if ai(x)=0 then q0←q0+wi, else q1←q1+wi If q1>q0, predicts c(x) to be 1, else if q1<q0, predicts c(x) to be 0, else predicts c(x) at random to be 1 or 0. For each ai, do If ai(x)≠c(x) (given by the teacher), wi←βwi

A Case Study: The Weighted-Majority Algorithm ai denotes the ith prediction algorithm in the pool A of algorithm. wi denotes the weight associated with ai. For all i initialize wi <-- 1 For each training example <x,c(x)> Initialize q0 and q1 to 0 For each prediction algorithm ai If ai(x)=0 then q0 <-- q0+wi If ai(x)=1 then q1 <-- q1+wi If q1 > q0 then predict c(x)=1 If q0 > q1 then predict c(x) =0 If q0=q1 then predict 0 or 1 at random for c(x) For each prediction algorithm ai in A do If ai(x)  c(x) then wi <-- wi

Comments on “Adjusting Weights” Idea The idea can be found in various problems such as pattern matching, where we might reduce weights of less frequently used patterns in the learned library The textbook claims that one benefit of the algorithm is that it is able to accommodate inconsistent training data, but in case of learning by query, we presume that answer given by the teacher is always correct.

Relative Mistake Bound for the Algorithm Theorem 7.3 Let D be the training sequence, A be any set of n prediction algorithms, and k be the minimum number of mistakes made by any algorithm in A for the training sequence D. Then the number of mistakes over D made by Weighted-Majority algorithm using β=0.5 is at most 2.4(k+log2n) Proof: The basic idea is that we compare the final weight of best prediction algorithm to the sum of weights over all predictions. Let aj be such algorithm with k mistakes, then its final weight wj=0.5k. Now consider the sum W of weights over all predictions, observe that for every mistake made, W is reduced to at most 0.75W.

Relative Mistake Bound for the Weighted-Majority Algorithm Let D be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training sequence D. Then the number of mistakes over D made by the Weighted-Majority algorithm using =1/2 is at most 2.4(k + log2n). This theorem can be generalized for any 0   1 where the bound becomes (k log2 1/ + log2n)/log2(2/(1+ ))