Computational Learning Theory

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Evaluating Classifiers
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Evaluating Hypotheses
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS 4700: Foundations of Artificial Intelligence
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Ch. 9 Fundamental of Hypothesis Testing
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 25 January 2008 William.
Learning from observations
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Classification Techniques: Bayesian Classification
Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
CpSc 881: Machine Learning Evaluating Hypotheses.
Chapter 6 Bayesian Learning
Machine Learning Chapter 5. Evaluating Hypotheses
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
1 CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman Machine Learning: The Theory of Learning R&N 18.5.
Machine Learning Concept Learning General-to Specific Ordering
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Computational Learning Theory Part 1: Preliminaries 1.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
© Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #28, Slide #1 Theoretical Approaches to Machine Learning Early work (eg.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Evaluating Hypotheses
Chapter 3: Maximum-Likelihood Parameter Estimation
Computational Learning Theory
Computational Learning Theory
CSCI B609: “Foundations of Data Science”
The
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Evaluating Hypotheses
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
Machine Learning Chapter 2
Machine Learning Chapter 2
Machine Learning: Lecture 5
Presentation transcript:

Computational Learning Theory RN, Chapter 18.5 Computational Learning Theory

Computational Learning Theory Inductive Learning Protocol Error Probably Approximately Correct Learning Consistency Filtering Sample Complexity Eg: Conjunction, Decision List Issues Bound Other Models

What General Laws constrain Inductive Learning? Sample Complexity How many training examples are sufficient to learn target concept? Computational Complexity Resources required to learn target concept? Want theory to relate: Training examples Quantity Quality How presented Complexity of hypothesis/concept space H Accuracy of approx to target concept  Probability of successful learning  These results only useful wrt O(…) !

Protocol Given: Learner observes sample S = {  xi, c(xi)  } Y So … No Pale 87 10 : Yes Clear 110 22 95 35 diseaseX Colour Press. Temp. Protocol Learner Given: set of examples X fixed (unknown) distribution D over X set of hypotheses H set of possible target concepts C Learner observes sample S = {  xi, c(xi)  } instances xi drawn from distr. D labeled by target concept c  C (Learner does NOT know c(.), D) Learner outputs h  H estimating c h is evaluated by performance on subsequent instances drawn from D For now: C = H (so c  H) Noise-free data Classifier Pale … N 90 32 Color Throat Sore- Press. Temp No diseaseX

True Error of Hypothesis Def'n: The true error of hypothesis h wrt target concept c distribution D  probability that h will misclassify instance drawn from D errD(h) = Prx D[ c(x)  h(x) ]

Probably Approximately Correct Goal: PAC-Learner produces hypothesis ĥ that is approximately correct, errD(ĥ)  0 with high probability P( errD(ĥ)  0 )  1 Double “hedging" approximately probably Need both!

PAC-Learning Def'n: Learner L PAC-learns class C (by H) if Learner L can draw labeled instance  x, c(x)  in unit time x  X drawn from distribution D labeled by target concept c  C Def'n: Learner L PAC-learns class C (by H) if 1. for any target concept c  C, any distribution D, any ,  > 0, L returns h  H s.t. w/ prob.  1 – , errD(h) <  2. L's run-time (and hence, sample complexity) is poly(|x|, size(c), 1/, 1/) Sufficient: 1. Only poly(…) training instances – |H| = 2poly() 2. Only poly time / instance … Often C = H 29/Nov/07

Simple Learning Algorithm: Consistency Filtering Draw mH( ,  ) random (labeled) examples Sm Remove every hyp. that contradicts any x, y  Sm Return any remaining (consistent) hypothesis Challenges: Q1: Sample size: mH( ,  ) Q2: Need to decide if h  H is consistent w/ all Sm ... efficiently ...

Boolean Functions ( Concepts) 1

Bad Hypotheses Idea: Find m = mH( ,  ) s.t. after seeing m examples, Leave HGood = { h  H | errD(h)   } Eliminate Hbad = { h  H | errD(h) >  } Idea: Find m = mH( ,  ) s.t. after seeing m examples, every BAD hypothesis h (errD,c(h) > ) will be ELIMINATED with high probability ( 1 – ) leaving only good hypotheses … then pick ANY of the remaining good (errD,c(h) <) hyp’s Find m large number that very small chance that a “bad” hypothesis is consistent with m examples

Ok Bad

Sample Bounds – Derivation Let h1 be -bad hypothesis … err( h1) >   h1 mis-labels example w/prob P( h1(x)  c(x) ) >   h1 correctly labels random example w/prob ≤ (1 – ) As examples drawn INDEPENDENTLY P(h1 correctly labels m examples ) ≤ (1 – )m

Sample Bounds – Derivation II Let h2 be another -bad hypothesis What is probability that either h1 or h2 survive m random examples? P( h1 v h2 survives ) = P(h1 survives ) + P(h2 survives ) – P(h1 & h2 survives ) ≤ P(h1 survives ) + P(h2 survives ) ≤ 2 (1 –)m If k -bad hypotheses {h1, …, hk}: P(h1 v … v hk survives ) ≤ k (1 –)m

Sample Bounds – Derivation Let h1 be -bad hypothesis … err( h1) >  ) h1 mis-labels example w/prob P(h1(x)  c(x) ) >  ) h1 correctly labels random example w/prob · (1 – ) As examples drawn INDEPENDENTLY P(h1 correctly labels m examples ) · (1 – )m Let h2 be another -bad hypothesis What is probability that either h1 or h2 survive m random examples? P(h1 v h2 survives ) = P(h1 survives ) + P(h2 survives ) – P(h1 & h2 survives ) · P(h1 survives ) + P(h2 survives ) · 2 (1 –)m

Sample Bounds, con't Let Hbad = { h  H | err( h ) >  } Probability that any h  Hbad survives is P( any hb in Hbad is consistent with m exs. ) ≤ |Hbad| (1 – )m ≤ |H| (1 – )m This is ≤  if |H| (1 – )m ≤   mH( ,  ) is “Sample Complexity” of hypothesis space H Fact: For 0 ≤  ≤ 1, (1 – ) ≤ e-

Sample Complexity Hypothesis Space (expressiveness): H Error Rate of Resulting Hypthesis:  errD,c(h) = P( h(x)  c(x) ) ≤  Confidence of being -close:  P( errD,c(h) ≤  ) > 1 –  Sample size: mH( ,  ) Any hypothesis consistent with examples, has error of at most , with prob ≤ 1 – 

Boolean Function... Conjunctions

As each step is efficient O(n), only poly(n, 1/, 1/) steps Never true True only for “101” True only for “10*” Just uses +-examples! Finds “smallest” hypothesis (true for as few +examples as possible) … No mistakes on –examples As each step is efficient O(n), only poly(n, 1/, 1/) steps  algorithm is efficient! Does NOT explicitly build all 3n conjunctions, then throw some out…

PAC-Learning k-CNF 1-CNF  Conjunctions (n choose k) = O(nk)

Decision Lists When to go for walk? How many DLs? Vars: rainy, warm, jacket, snowy Don't go for walk if rainy. Otherwise, go for walk if warm or if I jacket and it is snowy. How many DLs? 4n possible “rules”, each of form “xi  ”  (4n)! orderings, so |HDL| · (4n)! (Actually: ≤ n! 4n )

Example of Learning DL 1. When x1 = 0, class is "B" Form h =  x1  B  Eliminate i2, i4 2. When x2 = 1, class is "A" Form h =  x1  B; x2  A  Eliminate i3, i5 3. When x4 = 1, class is "A" Form h =  x1  B; x2  A; x4  A  Eliminate i1 4. Always have class "B" Form h =  x1  B; x2  A; x4  A; t  B  Eliminate rest (i6)

PAC-Learning Decision Lists “Covering” Algorithm

Proof (PAC-Learn DL)

Why Learning May Succeed Learner L produces classifier h = L(S) that does well on training data S Why? Assumption: Distribution is “stationary" distr. for testing = distr. for training 1. If x appears a lot then x probably occurs in training data S As h does well on S, h(x) is probably correct on x 2. If example x appears rarely ( P( x )  0) then h suffers only small penalty for being wrong.  

Comments on Model Simplify task: Complicate task: 1*. Assume c  H, where H known (Eg, lines, conjunctions, . . . ) 2*. Noise free training data 3. Only require approximate correctness: h is “-good": Px( h(x)  c(x) ) <  4. Allow learner to (rarely) be completely off If examples NOT representative, cannot do well. P( hL is -good) ≤ 1 -  Complicate task: 1. Learner must be computationally efficient 2. Over any instance distribution

Comments: Sample Complexity If k parameters,  v1, …, vk   |Hk|  Bk  mHk  log(Bk)/  k/ Too GENEROUS: Based on pre-defined C = {c1, …} = H Where did this come from??? Assumes c  H, noise-free If err  0, need O( 1/2 … )

Why is Bound so Lousy! Assumes error of all -bad hypotheses   (Typically most bad hypotheses are really bad  get thrown out much sooner) Uses P(A or B ) ≤ P(A)+P(B ). (If hypotheses are correlated, then if one inconsistent, others probably inconsistent too) Assumes |Hbad| = |H| … see VCdimension WorstCase: over all c  C over all distribution D over X over all presentations of instances (drawn from D) Improvements “Distribution Specific" learning Known single dist (-cover) Gaussian, . . . Look at samples!  Sequential PAC Learning If -bad, takes  1/ to see evidence

Fundamental Tradeoff in Machine Learning Larger H is more likely to include (approx to) target f but it requires more examples to learn w/few examples, cannot reliably find good hypothesis from large hypothesis space To learn effectively () from small # of samples (m), only consider H where |H|  e  m Restrict form of Boolean function to reduce size of hypotheses space. Eg, for HC = conjunctions of literals, |HC| = 3n, so only need poly number of examples! Great if target concept is in HC, but . . .

Issues Computational Complexity Sampling Issues: Finite Countable Uncountable Realizable Nested Class VC dim Agnostic -

Learning = Estimation + Optimization 1. Acquire required relevant information by examining enough labeled samples 2. Find hypothesis h  H consistent with those samples . . . often “smallest" hypothesis Spse H has 2k hypotheses Each hypothesis requires k bits  log |H|  |h| = k  SAMPLE COMPLEXITY not problematic But optimization often is. . . intractable! Eg, consistency for 2term–DNF is NP-hard, . . . Perhaps find best hypothesis in F  H 2-CNF  2term-DNF . . . easier optimization problem!

Extensions to this Model Ockham Algorithm: Can PAC-learn H iff can “compress” samples have efficient consistency-finding algorithm Data Efficient Learner Gathers samples sequentially, autonomously decides when to stop & return hypothesis Exploiting other information Prior background theory Relevance Degradation of Training/Testing Information

Other Learning Models Learning in the Limit [Recursion Theoretic] Exact identification, no resource constraints On-Line learning After seeing each unlabeled instance, learner returns (proposed) label Then correct label provided (learner penalized if wrong) Q: Can learner converge, after making only k mistakes? Active Learners Actively request useful information from environment “Experiment” “Agnostic Learning” What if target [ f  H] ? Want to find CLOSEST hypotheses. . . Typically NP-hard. . . Bayesian Approach: Model Averaging, . . .

Computational Learning Theory Inductive Learning is possible With caveats: error, confidence Depends on complexity of hypothesis space Probably Approximately Correct Learning Consistency Filtering Sample Complexity Eg: Conjunctions, Decision_Lists Many other meaningful models

Terminology Labeled example: Example of form  x, f(x)  Labeled sample: Set of {  xi; f(xi)  } Classifier: Discrete-valued function. Possible values f(x)  { 1, …, K } called “classes"; “class labels” Concept: Boolean function. x s.t. f(x) = 1 called “positive examples” x s.t. f(x) = 0 called “negative examples” Target function (target concept): “True function” f generating the labels Hypothesis: Proposed function h believed to be similar to f. Hypothesis Space: Space of all hypotheses that can, in principle, be output by a learning algorithm