Theoretical Approaches to Machine Learning Early work (eg. Gold) ignored efficiency Only considers computabilityOnly considers computability “Learning.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
The Theory of NP-Completeness
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Computational Learning Theory
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
Evaluating Hypotheses
Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 CSI 5388:Topics in Machine Learning Inductive Learning: A Review.
Approximation Algorithms Pages ADVANCED TOPICS IN COMPLEXITY THEORY.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Learning DFA from corrections Leonor Becerra-Bonache, Cristina Bibire, Adrian Horia Dediu Research Group on Mathematical Linguistics, Rovira i Virgili.
For Friday Read chapter 18, section 7 Program 3 due.
For Wednesday Read chapter 18, section 7 Homework: –Chapter 18, exercise 11.
Analysis of Algorithms
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 36-37: Foundation of Machine Learning.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Resource bounded dimension and learning Elvira Mayordomo, U. Zaragoza CIRM, 2009 Joint work with Ricard Gavaldà, María López-Valdés, and Vinodchandran.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
1 CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman Machine Learning: The Theory of Learning R&N 18.5.
Today’s Topics (only on final at a high level; Sec 19.5 and Sec 18.5 readings below are ‘skim only’) 12/8/15CS Fall 2015 (Shavlik©), Lecture 30,
Machine Learning Concept Learning General-to Specific Ordering
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Computational Learning Theory Part 1: Preliminaries 1.
Lecture. Today Problem set 9 out (due next Thursday) Topics: –Complexity Theory –Optimization versus Decision Problems –P and NP –Efficient Verification.
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #28, Slide #1 Theoretical Approaches to Machine Learning Early work (eg.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Computational Learning Theory
CS 9633 Machine Learning Concept Learning
Computational Learning Theory
Computational Learning Theory
Introduction to Machine Learning
CH. 2: Supervised Learning
Vapnik–Chervonenkis Dimension
CSCI B609: “Foundations of Data Science”
The
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
Presentation transcript:

Theoretical Approaches to Machine Learning Early work (eg. Gold) ignored efficiency Only considers computabilityOnly considers computability “Learning in the limit”“Learning in the limit” Later work considers tractable inductive learning With high probability, approximately learnWith high probability, approximately learn Polynomial runtime, polynomial # of examples neededPolynomial runtime, polynomial # of examples needed Results (usually) independent of probability distribution for the examplesResults (usually) independent of probability distribution for the examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Identification in the Limit Definition After some finite number of examples, learner will have learned the correct concept (though might not even know it!). Correct means agrees with target concept on labels for all data. Example Consider noise-free learning from the class {f | f(n) = a*n mod b} where a and b are natural numbers General Technique “Innocent Until Proven Guilty” Enumerate all possible answers Search for simplest answer consistent with training examples seen so far; sooner or later will hit solution © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Some Results (Gold) Computable languages (Turing machines) can be learned in the limit using inference by enumeration.Computable languages (Turing machines) can be learned in the limit using inference by enumeration. If data set is limited to positive examples only, then only finite languages can be learned in the limit.If data set is limited to positive examples only, then only finite languages can be learned in the limit. © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Solution for {f | f(n) = a*n mod b} a b n f(n) 9 17 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

The Mistake-Bound Model (Littlestone) Framework Teacher shows input I Teacher shows input I ML algorithm guesses output O ML algorithm guesses output O Teacher shows correct answer Teacher shows correct answer Can we upper bound the number of errors the learner will make? Can we upper bound the number of errors the learner will make? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

The Mistake-Bound Model Example Learn a conjunct from N predicates and their negations Initial h = p 1  ¬p 1  …  p n  ¬p n Initial h = p 1  ¬p 1  …  p n  ¬p n For each + ex, remove the remaining terms that do not match For each + ex, remove the remaining terms that do not match © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

The Mistake-Bound Model Worst case # of mistakes? 1 + N First + ex will remove N terms from h initial First + ex will remove N terms from h initial Each subsequent error on a + will remove at least one more term (never make a mistake on - ex’s) Each subsequent error on a + will remove at least one more term (never make a mistake on - ex’s) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Equivalence Query Model (Angluin) Framework ML algorithm guesses concept: is target equivalent to this guess?) ML algorithm guesses concept: is target equivalent to this guess?) Teacher either says “yes” or returns a counterexample (example labeled differently by target and guess) Teacher either says “yes” or returns a counterexample (example labeled differently by target and guess) Can we upper bound the number of errors the learner will make? Can we upper bound the number of errors the learner will make? Time to compute next guess bounded by Poly(|data seen so far|) Time to compute next guess bounded by Poly(|data seen so far|) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Probably Approximately Correct (PAC) Learning PAC learning (Valiant ’84) Given Xdomain of possible examples C class of possible concepts to label X c  Ctarget concept ,  correctness bounds © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Probably Approximately Correct (PAC) Learning For any target c in C and any distribution D on XFor any target c in C and any distribution D on X Given at least N = poly(|c|,1/  examples drawn randomly, independently from XGiven at least N = poly(|c|,1/  examples drawn randomly, independently from X Do with probability 1 - , return an h in C whose accuracy is at least 1 - Do with probability 1 - , return an h in C whose accuracy is at least 1 -  In other wordsIn other words Prob[error(h, c) >  ]  ] <  In time polynomial in |data|In time polynomial in |data| h c Shaded regions are where errors occur: © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Relationships Among Models of Tractable Learning Poly mistake-bounded with poly update time = EQ-learningPoly mistake-bounded with poly update time = EQ-learning EQ-learning implies PAC-learningEQ-learning implies PAC-learning Simulate teacher by poly-sized random sample; if all labeled correctly, say “yes”; otherwise, return incorrect exampleSimulate teacher by poly-sized random sample; if all labeled correctly, say “yes”; otherwise, return incorrect example On each query, increase sample size based on Bonferoni correctionOn each query, increase sample size based on Bonferoni correction © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

To Prove Concept Class is PAC-learnable 1.Show it’s EQ-learnable, OR 2.Show the following: There exists an efficient algorithm for the consistency problem (find a hypothesis consistent with a data set in time poly in the size of the data set)There exists an efficient algorithm for the consistency problem (find a hypothesis consistent with a data set in time poly in the size of the data set) Poly-sized sample is sufficient to give us our accuracy guaranteesPoly-sized sample is sufficient to give us our accuracy guarantees © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Useful Trick: A Maclaurin series: for -1 < x <= 1: We only care about -1<x<0. Rewriting -x as ε, we can derive that for 0<ε<1: ln(1-ε) <= -ε. Because ε is positive, we can put both sides of this last inequality into an exponent to obtain: 1 - ε <= e -ε © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Assume EQ Query Bound (or Mistake Bound) M On each equivalence query, draw:On each equivalence query, draw: 1/ε ln(M/δ) examples What is the probability that on any of the at most M rounds, we accept a “bad” hypothesis?What is the probability that on any of the at most M rounds, we accept a “bad” hypothesis? At most M (1-ε) 1/ε ln(M/δ) <= M e -ln(M/δ)At most M (1-ε) 1/ε ln(M/δ) <= M e -ln(M/δ) (using the useful trick) = δ (using the useful trick) = δ © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

If Algorithm Doesn’t Know M in Advance (recall |c|): On ith equivalence query, draw:On ith equivalence query, draw: 1/ε (ln(1/δ) + i ln(2)) examples 1/ε (ln(1/δ) + i ln(2)) examples What is the probability that we accept a “bad” hypothesis at one query?What is the probability that we accept a “bad” hypothesis at one query? (1-ε) 1/ε (ln(1/δ) + i ln(2)) <= e - (ln(1/δ) + i ln(2))(1-ε) 1/ε (ln(1/δ) + i ln(2)) <= e - (ln(1/δ) + i ln(2)) (using the useful trick) = δ/2 i (using the useful trick) = δ/2 i So probability we ever accept a “bad” hypothesis is at most δ.So probability we ever accept a “bad” hypothesis is at most δ. © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Using First Method: kDNF Write down disjunction of all conjunctions of at most k literals (features or negated features)Write down disjunction of all conjunctions of at most k literals (features or negated features) Any counterexample will be actual negativeAny counterexample will be actual negative Repeat until correct:Repeat until correct: Given a counterexample, delete disjuncts that cover it (are consistent with it) Given a counterexample, delete disjuncts that cover it (are consistent with it) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Using Second Method If hypothesis space is finite, can show a poly sample is sufficientIf hypothesis space is finite, can show a poly sample is sufficient If hypothesis space is parameterized by n, and grows only exponentially in n, can show a poly sample is sufficientIf hypothesis space is parameterized by n, and grows only exponentially in n, can show a poly sample is sufficient © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

How Many Examples Needed to be PAC? Consider finite hypothesis spaces Consider finite hypothesis spaces Let H bad  {h 1, …, h z } Let H bad  {h 1, …, h z } The set of hypotheses whose (“testset”) error is >  The set of hypotheses whose (“testset”) error is >  Goal Eliminate all items in H bad via (noise-free) training examples Goal Eliminate all items in H bad via (noise-free) training examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

How Many Examples Needed to be PAC? How can an h look bad, even though it is correct on all the training examples? If we never see any examples in the shaded regions If we never see any examples in the shaded regions We’ll compute an N s.t. the odds of this are sufficiently low (recall, N = number of examples) We’ll compute an N s.t. the odds of this are sufficiently low (recall, N = number of examples) h c © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

H bad Consider H 1  H bad and ex  N Consider H 1  H bad and ex  N What is the probability that H 1 is consistent with ex? What is the probability that H 1 is consistent with ex? Prob[consistent A (ex,H 1 )] ≤ 1 -  Prob[consistent A (ex,H 1 )] ≤ 1 -  (since H 1 is bad its error rate is at least  ) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

H bad (cont.) What is the probability that H 1 is consistent with all N examples? Prob[consistent B (N,H 1 )] ≤ (1 -  ) |N| (by iid assumption) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

H bad (cont.) What is the probability that some member of H bad is consistent with the examples in N ? Prob[consistent C (N, H bad )]  Prob[consistent B (N, H 1 )  …  Prob[consistent B (N, H 1 )  …  consistent B (N, H z )] consistent B (N, H z )] ≤ |H bad | x (1-  ) |N| // P(A  B)  P(A) + P(B) - P(A  B) ≤ |H bad | x (1-  ) |N| // P(A  B)  P(A) + P(B) - P(A  B) ≤ |H| x (1-  ) |N| // H bad  H ≤ |H| x (1-  ) |N| // H bad  H Ignore this in upper bound calc © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Solving for |N| We have Prob[consistent C (N,H bad )] ≤ |H| x (1-  ) |N| <  Recall that we want the prob of a bad concept surviving to be less than , our bound on learning a poor concept Assume that if many consistent hypotheses survive, we get unlucky and choose a bad one (we’re doing a worst-case analysis) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Solving for |N| (number of examples needed to be confident of getting a good model) Solving |N| > [ln(1/  )+ln(|H|)] / -ln(1-  ) Since  ≤ -ln(1-  ) over [0,1) we get |N| > [ln(1/  )+ln(|H|)] /  (Aside: notice that this calculation assumed we could always find a hypothesis that fits the training data) Notice we made NO assumptions about the prob dist of the data (other than it does not change) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Example: Number of Instances Needed Assume F = 100 binary features H = all (pure) conjuncts [3 F possibilities (  i, use f i, use ¬f i, or ignore f i ) so lg|H| = F * lg 3 ≈ F]  = 0.01  = 0.01 N = [ln(1/  )+ln(|H|)] /  = 100 * [ln(100) + 100] ≈ 10 4 But how many real-world concepts are pure conjuncts with noise-free training data? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Two Senses of Complexity Sample complexity (number of examples needed) vs. Time complexity (time needed to find h  H that is consistent with the training examples) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Complexity (cont.) Some concepts require a polynomial number of examples but an exponential amount of time (in the worst case)Some concepts require a polynomial number of examples but an exponential amount of time (in the worst case) Eg, training neural networks is NP-hard (recall BP is a “greedy” algorithm that finds a local min)Eg, training neural networks is NP-hard (recall BP is a “greedy” algorithm that finds a local min) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Dealing with Infinite Hypothesis Spaces Can use the Vapnik-Chervonenkis (‘71) dimension (VC-dim)Can use the Vapnik-Chervonenkis (‘71) dimension (VC-dim) Provides a measure of the capacity of a hypothesis spaceProvides a measure of the capacity of a hypothesis space VC-dim  given a hypothesis space H, the VC-dim is the size of the largest set of examples that can be completely fit by H, no matter how the examples are labeled © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

VC-dim Impact If the number of examples << VC-dim, then memorizing training is trivial and generalization likely to be poor If the number of examples << VC-dim, then memorizing training is trivial and generalization likely to be poor If the number of examples >> VC-dim, then the algorithm must generalize to do well on the training set and will likely do well in the future If the number of examples >> VC-dim, then the algorithm must generalize to do well on the training set and will likely do well in the future © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Samples of VC-dim Finite H VC-Dim ≤ log 2 |H| (if d examples, 2 d different labelings possible, and must have 2 d ≤ |H| if all functions are to be in H) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

An Infinite Hypothesis Space with a Finite VC Dim H is set of lines in 2D Can cover 1 ex no matter how labeled 1 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Example 2 (cont.) Can cover 2 ex’s no matter how labeled 1 2 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Example 2 (cont.) Can cover 3 ex’s no matter how labeled © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Example 2 (cont.) Cannot cover/separate if 1 and 4 are +, But 2 and 3 are – (our old friend, ex-or) Notice |H| = ∞ but VC-dim = 3 For N-dimensions and N-1 dim hyperplanes, VC-dim = N + 1 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

More on “Shattering” What about collinear points? If  some set of d examples that H can fully fit  labellings of these d examples then VC(H) ≥ d © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Some VC-Dim Theorems Theorem H is PAC-learnable iff its VC-dim is finite Theorem Sufficient to be PAC to have # of examples > 1/  max[4ln(2/  ), 8ln(13/  )VC-dim(H)] Theorem Any PAC algorithm needs at least  (1/  [ln(1/  ) + VC-dim(H)]) examples [No need to memorize these for the exam] © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

To Show a Concept is NOT PAC-learnable Show the consistency problem is NP- hard (hardness assumes P ≠ NP), ORShow the consistency problem is NP- hard (hardness assumes P ≠ NP), OR Show the VC-dimension grows at a rate not bounded by any polynomialShow the VC-dimension grows at a rate not bounded by any polynomial © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Be Careful It can be the case that consistency is hard for a concept class, but not for a larger classIt can be the case that consistency is hard for a concept class, but not for a larger class Consistency NP-hard for k-term DNFConsistency NP-hard for k-term DNF Consistency easy for DNF (PAC still open ques.)Consistency easy for DNF (PAC still open ques.) More robust negative results are for PAC- predictionMore robust negative results are for PAC- prediction Hypothesis space not constrained to equal concept classHypothesis space not constrained to equal concept class Hardness results based on cryptographic assumptions, such as assuming efficient factoring is impossibleHardness results based on cryptographic assumptions, such as assuming efficient factoring is impossible © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Some Variations in Definitions Original definition of PAC-learning required:Original definition of PAC-learning required: Run-time to be polynomialRun-time to be polynomial Hypothesis to be within concept class (otherwise, called PAC-prediction)Hypothesis to be within concept class (otherwise, called PAC-prediction) Now this version is often called polynomial-time, proper PAC-learningNow this version is often called polynomial-time, proper PAC-learning Membership queries: can ask for labels of specific examplesMembership queries: can ask for labels of specific examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Some Results… Can PAC-learn k-DNF (exponential in k, but k is now a constant)Can PAC-learn k-DNF (exponential in k, but k is now a constant) Can NOT proper PAC-learn k-term DNF, but can PAC-learn it by k-CNFCan NOT proper PAC-learn k-term DNF, but can PAC-learn it by k-CNF Can NOT PAC-learn Boolean Formulae unless can crack RSA (can factor fast)Can NOT PAC-learn Boolean Formulae unless can crack RSA (can factor fast) Can PAC-learn decision-trees with the addition of membership queriesCan PAC-learn decision-trees with the addition of membership queries Unknown whether can PAC-learn DNF or DTs (“holy grail” questions of COLT)Unknown whether can PAC-learn DNF or DTs (“holy grail” questions of COLT) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Some Other COLT Topics COLT + clustering + k-NN + RL + EBL (ch. 11 of Mitchell) + SVMs + ILP + ANNs, etc. Average case analysis (vs. worst case) Average case analysis (vs. worst case) Learnability of natural languages (language innate?) Learnability of natural languages (language innate?) Learnability in parallel Learnability in parallel © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Summary of COLT Strengths Formalizes learning task Formalizes learning task Allows for imperfections (e.g.  and  in PAC) Allows for imperfections (e.g.  and  in PAC) Work on boosting (later) is excellent case of ML theory influencing ML practice Work on boosting (later) is excellent case of ML theory influencing ML practice Shows what concepts are intrinsically hard to learn (e.g. k-term DNF) Shows what concepts are intrinsically hard to learn (e.g. k-term DNF) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

Summary of COLT Weaknesses Most analyses are worst case Most analyses are worst case Use of “prior knowledge” not captured very well yet Use of “prior knowledge” not captured very well yet © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)