Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 16 March 2007 William.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

VC Dimension – definition and impossibility result
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
HW HW1: Let us know if you have any questions. ( the TAs) HW2:
Machine Learning Chapter 10. Learning Sets of Rules Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Tuesday, November 27, 2001 William.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Adapted by Doug Downey from: Bryan Pardo, EECS 349 Fall 2007 Machine Learning Lecture 2: Concept Learning and Version Spaces 1.
Probably Approximately Correct Model (PAC)
Evaluating Hypotheses
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
A New Linear-threshold Algorithm Anna Rapoport Lev Faivishevsky.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, January 19, 2001.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 25 January 2008 William.
Machine Learning Chapter 3. Decision Tree Learning
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 11 April 2007 William.
Machine Learning CSE 681 CH2 - Supervised Learning.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Decision Tree Learning.
General-to-Specific Ordering. 8/29/03Logic Based Classification2 SkyAirTempHumidityWindWaterForecastEnjoySport SunnyWarmNormalStrongWarmSameYes SunnyWarmHighStrongWarmSameYes.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
November 10, Machine Learning: Lecture 9 Rule Learning / Inductive Logic Programming.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
CpSc 810: Machine Learning Concept Learning and General to Specific Ordering.
Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.
Outline Inductive bias General-to specific ordering of hypotheses
Overview Concept Learning Representation Inductive Learning Hypothesis
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CS 5751 Machine Learning Chapter 10 Learning Sets of Rules1 Learning Sets of Rules Sequential covering algorithms FOIL Induction as the inverse of deduction.
Machine Learning Concept Learning General-to Specific Ordering
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Concept Learning and The General-To Specific Ordering
Computational Learning Theory Part 1: Preliminaries 1.
Chap. 10 Learning Sets of Rules 박성배 서울대학교 컴퓨터공학과.
Concept learning Maria Simi, 2011/2012 Machine Learning, Tom Mitchell Mc Graw-Hill International Editions, 1997 (Cap 1, 2).
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
HW HW1: Let us know if you have any questions. ( the TAs)
Computational Learning Theory
CS 9633 Machine Learning Concept Learning
Computational Learning Theory
Computational Learning Theory
CH. 2: Supervised Learning
Vapnik–Chervonenkis Dimension
Machine Learning Chapter 3. Decision Tree Learning
Computational Learning Theory
Machine Learning: Lecture 3
Computational Learning Theory
Machine Learning Chapter 3. Decision Tree Learning
CSCI B609: “Foundations of Data Science”
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
Machine Learning Chapter 2
Supervised machine learning: creating a model
Version Space Machine Learning Fall 2018.
INTRODUCTION TO Machine Learning 3rd Edition
Machine Learning Chapter 2
Presentation transcript:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 16 March 2007 William H. Hsu Department of Computing and Information Sciences, KSU Readings: Sections , , Mitchell Sections 10.1 – 10.2, Mitchell More Computational Learning Theory and Classification Rule Learning Lecture 26 of 42

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Lecture Outline Read , , Mitchell; Chapter 1, Kearns and Vazirani Suggested Exercises: 7.2, Mitchell; 1.1, Kearns and Vazirani PAC Learning (Continued) –Examples and results: learning rectangles, normal forms, conjunctions –What PAC analysis reveals about problem difficulty –Turning PAC results into design choices Occam’s Razor: A Formal Inductive Bias –Preference for shorter hypotheses –More on Occam’s Razor when we get to decision trees Vapnik-Chervonenkis (VC) Dimension –Objective: label any instance of (shatter) a set of points with a set of functions –VC(H): a measure of the expressiveness of hypothesis space H Mistake Bounds –Estimating the number of mistakes made before convergence –Optimal error bounds

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition PAC Learning: k-CNF, k-Clause-CNF, k-DNF, k-Term-DNF k-CNF (Conjunctive Normal Form) Concepts: Efficiently PAC-Learnable –Conjunctions of any number of disjunctive clauses, each with at most k literals –c = C 1  C 2  …  C m ; C i = l 1  l 1  …  l k ; ln (| k-CNF |) = ln (2 (2n) k ) =  (n k ) –Algorithm: reduce to learning monotone conjunctions over n k pseudo-literals C i k-Clause-CNF –c = C 1  C 2  …  C k ; C i = l 1  l 1  …  l m ; ln (| k-Clause-CNF |) = ln (3 kn ) =  (kn) –Efficiently PAC learnable? See below (k-Clause-CNF, k-Term-DNF are duals) k-DNF (Disjunctive Normal Form) –Disjunctions of any number of conjunctive terms, each with at most k literals –c = T 1  T 2  …  T m ; T i = l 1  l 1  …  l k k-Term-DNF: “Not” Efficiently PAC-Learnable (Kind Of, Sort Of…) –c = T 1  T 2  …  T k ; T i = l 1  l 1  …  l m ; ln (| k-Term-DNF |) = ln (k3 n ) =  (n + ln k) –Polynomial sample complexity, not computational complexity (unless RP = NP) –Solution: Don’t use H = C! k-Term-DNF  k-CNF (so let H = k-CNF)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition PAC Learning: Rectangles Assume Target Concept Is An Axis Parallel (Hyper)rectangle Will We Be Able To Learn The Target Concept? Can We Come Close? X Y

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Consistent Learners General Scheme for Learning –Follows immediately from definition of consistent hypothesis –Given: a sample D of m examples –Find: some h  H that is consistent with all m examples –PAC: show that if m is large enough, a consistent hypothesis must be close enough to c –Efficient PAC (and other COLT formalisms): show that you can compute the consistent hypothesis efficiently Monotone Conjunctions –Used an Elimination algorithm (compare: Find-S) to find a hypothesis h that is consistent with the training set (easy to compute) –Showed that with sufficiently many examples (polynomial in the parameters), then h is close to c –Sample complexity gives an assurance of “convergence to criterion” for specified m, and a necessary condition (polynomial in n) for tractability

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Occam’s Razor and PAC Learning [1] Bad Hypothesis – –Want to bound: probability that there exists a hypothesis h  H that is consistent with m examples satisfies error D (h) >  –Claim: the probability is less than | H | (1 -  ) m Proof –Let h be such a bad hypothesis –The probability that h is consistent with one example of c is –Because the m examples are drawn independently of each other, the probability that h is consistent with m examples of c is less than (1 -  ) m –The probability that some hypothesis in H is consistent with m examples of c is less than | H | (1 -  ) m, Quod Erat Demonstrandum

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Occam’s Razor and PAC Learning [2] Goal –We want this probability to be smaller than , that is: | H | (1 -  ) m <  ln (| H |) + m ln (1 -  ) < ln (  ) –With ln (1 -  )   : m  1/  (ln | H | + ln (1/  )) –This is the result from last time [Blumer et al, 1987; Haussler, 1988] Occam’s Razor –“Entities should not be multiplied without necessity” –So called because it indicates a preference towards a small H –Why do we want small H? Generalization capability: explicit form of inductive bias Search capability: more efficient, compact –To guarantee consistency, need H  C – really want the smallest H possible?

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition VC Dimension: Framework Infinite Hypothesis Space? –Preceding analyses were restricted to finite hypothesis spaces –Some infinite hypothesis spaces are more expressive than others, e.g., rectangles vs. 17-sided convex polygons vs. general convex polygons linear threshold (LT) function vs. a conjunction of LT units –Need a measure of the expressiveness of an infinite H other than its size Vapnik-Chervonenkis Dimension: VC(H) –Provides such a measure –Analogous to | H |: there are bounds for sample complexity using VC(H)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition VC Dimension: Shattering A Set of Instances Dichotomies –Recall: a partition of a set S is a collection of disjoint sets S i whose union is S –Definition: a dichotomy of a set S is a partition of S into two subsets S 1 and S 2 Shattering –A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S, there exists a hypothesis in H consistent with this dichotomy –Intuition: a rich set of functions shatters a larger instance space The “Shattering Game” (An Adversarial Interpretation) –Your client selects an S (an instance space X) –You select an H –Your adversary labels S (i.e., chooses a point c from concept space C = 2 X ) –You must find then some h  H that “covers” (is consistent with) c –If you can do this for any c your adversary comes up with, H shatters S

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition VC Dimension: Examples of Shattered Sets Three Instances Shattered Intervals –Left-bounded intervals on the real axis: [0, a), for a  R  0 Sets of 2 points cannot be shattered Given 2 points, can label so that no hypothesis will be consistent –Intervals on the real axis ([a, b], b  R > a  R ): can shatter 1 or 2 points, not 3 –Half-spaces in the plane (non-collinear): 1? 2? 3? 4? Instance Space X 0a ab +

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition VC Dimension: Definition and Relation to Inductive Bias Vapnik-Chervonenkis Dimension –The VC dimension VC(H) of hypothesis space H (defined over implicit instance space X) is the size of the largest finite subset of X shattered by H –If arbitrarily large finite sets of X can be shattered by H, then VC(H)   –Examples VC(half intervals in R ) = 1no subset of size 2 can be shattered VC(intervals in R ) = 2no subset of size 3 VC(half-spaces in R 2 ) = 3no subset of size 4 VC(axis-parallel rectangles in R 2 ) = 4no subset of size 5 Relation of VC(H) to Inductive Bias of H –Unbiased hypothesis space H shatters the entire instance space X –i.e., H is able to induce every partition on set X of all of all possible instances –The larger the subset X that can be shattered, the more expressive a hypothesis space is, i.e., the less biased

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition VC Dimension: Relation to Sample Complexity VC(H) as A Measure of Expressiveness –Prescribes an Occam algorithm for infinite hypothesis spaces –Given: a sample D of m examples Find some h  H that is consistent with all m examples If m > 1/  (8 VC(H) lg 13/  + 4 lg (2/  )), then with probability at least (1 -  ), h has true error less than  Significance If m is polynomial, we have a PAC learning algorithm To be efficient, we need to produce the hypothesis h efficiently Note –| H | > 2 m required to shatter m examples –Therefore VC(H)  lg(H)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Mistake Bounds: Rationale and Framework So Far: How Many Examples Needed To Learn? Another Measure of Difficulty: How Many Mistakes Before Convergence? Similar Setting to PAC Learning Environment –Instances drawn at random from X according to distribution D –Learner must classify each instance before receiving correct classification from teacher –Can we bound number of mistakes learner makes before converging? –Rationale: suppose (for example) that c = fraudulent credit card transactions

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Mistake Bounds: Find-S Scenario for Analyzing Mistake Bounds –Suppose H = conjunction of Boolean literals –Find-S Initialize h to the most specific hypothesis l 1   l 1  l 2   l 2  …  l n   l n For each positive training instance x: remove from h any literal that is not satisfied by x Output hypothesis h How Many Mistakes before Converging to Correct h? –Once a literal is removed, it is never put back (monotonic relaxation of h) –No false positives (started with most restrictive h): count false negatives –First example will remove n candidate literals (which don’t match x 1 ’s values) –Worst case: every remaining literal is also removed (incurring 1 mistake each) –For this concept (  x. c(x) = 1, aka “true”), Find-S makes n + 1 mistakes

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Mistake Bounds: Halving Algorithm Scenario for Analyzing Mistake Bounds –Halving Algorithm: learn concept using version space e.g., Candidate-Elimination algorithm (or List-Then-Eliminate) –Need to specify performance element (how predictions are made) Classify new instances by majority vote of version space members How Many Mistakes before Converging to Correct h? –… in worst case? Can make a mistake when the majority of hypotheses in VS H,D are wrong But then we can remove at least half of the candidates Worst case number of mistakes: –… in best case? Can get away with no mistakes! (If we were lucky and majority vote was right, VS H,D still shrinks)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Optimal Mistake Bounds Upper Mistake Bound for A Particular Learning Algorithm –Let M A (C) be the max number of mistakes made by algorithm A to learn concepts in C Maximum over c  C, all possible training sequences D Minimax Definition –Let C be an arbitrary non-empty concept class –The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of M A (C) –

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition COLT Conclusions PAC Framework –Provides reasonable model for theoretically analyzing effectiveness of learning algorithms –Prescribes things to do: enrich the hypothesis space (search for a less restrictive H); make H more flexible (e.g., hierarchical); incorporate knowledge Sample Complexity and Computational Complexity –Sample complexity for any consistent learner using H can be determined from measures of H’s expressiveness (| H |, VC(H), etc.) –If the sample complexity is tractable, then the computational complexity of finding a consistent h governs the complexity of the problem –Sample complexity bounds are not tight! (But they separate learnable classes from non-learnable classes) –Computational complexity results exhibit cases where information theoretic learning is feasible, but finding a good h is intractable COLT: Framework For Concrete Analysis of the Complexity of L –Dependent on various assumptions (e.g., x  X contain relevant variables)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Lecture Outline Readings: Sections , Mitchell; Section 21.4 Russell and Norvig Suggested Exercises: 10.1, 10.2 Mitchell Sequential Covering Algorithms –Learning single rules by search –Beam search –Alternative covering methods –Learning rule sets First-Order Rules –Learning single first-order rules –FOIL: learning first-order rule sets

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learning Disjunctive Sets of Rules Method 1: Rule Extraction from Trees –Learn decision tree –Convert to rules –One rule per root-to-leaf path –Recall: can post-prune rules (drop pre-conditions to improve validation set accuracy) Method 2: Sequential Covering –Idea: greedily (sequentially) find rules that apply to (cover) instances in D –Algorithm –Learn one rule with high accuracy, any coverage –Remove positive examples (of target attribute) covered by this rule –Repeat

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Sequential Covering: Algorithm Algorithm Sequential-Covering (Target-Attribute, Attributes, D, Threshold) –Learned-Rules  {} –New-Rule  Learn-One-Rule (Target-Attribute, Attributes, D) –WHILE Performance (Rule, Examples) > Threshold DO –Learned-Rules += New-Rule// add new rule to set –D.Remove-Covered-By (New-Rule)// remove examples covered by New-Rule –New-Rule  Learn-One-Rule (Target-Attribute, Attributes, D) –Sort-By-Performance (Learned-Rules, Target-Attribute, D) –RETURN Learned-Rules What Does Sequential-Covering Do? –Learns one rule, New-Rule –Takes out every example in D to which New-Rule applies (every covered example)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition IF {Humidity = Normal} THEN Play-Tennis = Yes IF {Wind = Strong} THEN Play-Tennis = No IF {Wind = Light} THEN Play-Tennis = Yes IF {Humidity = High} THEN Play-Tennis = No … Learn-One-Rule: (Beam) Search for Preconditions IF {} THEN Play-Tennis = Yes … IF {Humidity = Normal, Outlook = Sunny} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Strong} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Light} THEN Play-Tennis = Yes IF {Humidity = Normal, Outlook = Rain} THEN Play-Tennis = Yes

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Learn-One-Rule: Algorithm Algorithm Sequential-Covering (Target-Attribute, Attributes, D) –Pos  D.Positive-Examples() –Neg  D.Negative-Examples() –WHILE NOT Pos.Empty() DO// learn new rule –Learn-One-Rule (Target-Attribute, Attributes, D) –Learned-Rules.Add-Rule (New-Rule) –Pos.Remove-Covered-By (New-Rule) –RETURN (Learned-Rules) Algorithm Learn-One-Rule (Target-Attribute, Attributes, D) –New-Rule  most general rule possible –New-Rule-Neg  Neg –WHILE NOT New-Rule-Neg.Empty() DO// specialize New-Rule 1. Candidate-Literals  Generate-Candidates()// NB: rank by Performance() 2. Best-Literal  argmax L  Candidate-Literals Performance (Specialize-Rule (New-Rule, L), Target-Attribute, D)// all possible new constraints 3. New-Rule.Add-Precondition (Best-Literal)// add the best one 4. New-Rule-Neg  New-Rule-Neg.Filter-By (New-Rule) –RETURN (New-Rule)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Terminology PAC Learning: Example Concepts –Monotone conjunctions –k-CNF, k-Clause-CNF, k-DNF, k-Term-DNF –Axis-parallel (hyper)rectangles –Intervals and semi-intervals Occam’s Razor: A Formal Inductive Bias –Occam’s Razor: ceteris paribus (all other things being equal), prefer shorter hypotheses (in machine learning, prefer shortest consistent hypothesis) –Occam algorithm: a learning algorithm that prefers short hypotheses Vapnik-Chervonenkis (VC) Dimension –Shattering –VC(H) Mistake Bounds –M A (C) for A  Find-S, Halving –Optimal mistake bound Opt(H)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Summary Points COLT: Framework Analyzing Learning Environments –Sample complexity of C (what is m?) –Computational complexity of L –Required expressive power of H –Error and confidence bounds (PAC: 0 <  < 1/2, 0 <  < 1/2) What PAC Prescribes –Whether to try to learn C with a known H –Whether to try to reformulate H (apply change of representation) Vapnik-Chervonenkis (VC) Dimension –A formal measure of the complexity of H (besides | H |) –Based on X and a worst-case labeling game Mistake Bounds –How many could L incur? –Another way to measure the cost of learning Next Week: Decision Trees