Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Classification Techniques: Decision Tree Learning
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Ensemble Learning: An Introduction
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
ICS 273A Intro Machine Learning
Three kinds of learning
Inductive Learning (1/2) Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
ICS 273A Intro Machine Learning
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Classification.
Machine Learning: Ensemble Methods
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensembles of Classifiers Evgueni Smirnov
Issues with Data Mining
Inductive Learning (1/2) Decision Tree Method
CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Learning from Observations Chapter 18 Through
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS B351: D ECISION T REES. A GENDA Decision trees Learning curves Combatting overfitting.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSE 473 Ensemble Learning. © CSE AI Faculty 2 Ensemble Learning Sometimes each learning technique yields a different hypothesis (or function) But no perfect.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
More Symbolic Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Chapter 18 Section 1 – 3 Learning from Observations.
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Machine Learning: Ensemble Methods
Reading: R. Schapire, A brief introduction to boosting
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Data Mining Practical Machine Learning Tools and Techniques
Model Combination.
A task of induction to find patterns
Inductive Learning (2/2) Version Space and PAC Learning
Presentation transcript:

Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY

Training Set Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

Possible Decision Tree D CE B E AA A T F F FF F T T T TT

D CE B E AA A T F F FF F T T T TT CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A)))))) A? B? C? True FalseTrue False CONCEPT  A  (  B v C)

Possible Decision Tree D CE B E AA A T F F FF F T T T TT A? B? C? True FalseTrue False CONCEPT  A  (  B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A))))))

Getting Started: Top-Down Induction of Decision Tree True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

Getting Started: Top-Down Induction of Decision Tree True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

Assume It’s A A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise  The number of misclassified examples from the training set is 2

Assume It’s B B True: False: 9, 10 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise  The number of misclassified examples from the training set is 5 6, 7, 8, 13

Assume It’s C C True: False: 6, 8, 9, 10, 13 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise  The number of misclassified examples from the training set is 4 7

Assume It’s D D T F If we test only D, we will report that CONCEPT is True if D is True and False otherwise  The number of misclassified examples from the training set is 5 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

Assume It’s E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7

Assume It’s E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7 So, the best predicate to test is A

Choice of Second Predicate A T F C True: False: 6, 8, 9, 10, 13 11, 12 7 T F False  The number of misclassified examples from the training set is 1

Choice of Third Predicate C T F B True: False: 11,12 7 T F A T F False True

Final Tree A C True B False CONCEPT  A  (C v  B) CONCEPT  A  (  B v C) A? B? C? True False True False

Top-Down Induction of a DT DTL( , Predicates) 1.If all examples in  are positive then return True 2.If all examples in  are negative then return False 3.If Predicates is empty then return failure 4.A  error-minimizing predicate in Predicates 5.Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Subset of examples that satisfy A

Top-Down Induction of a DT DTL( , Predicates) 1.If all examples in  are positive then return True 2.If all examples in  are negative then return False 3.If Predicates is empty then return failure 4.A  error-minimizing predicate in Predicates 5.Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Noise in training set! May return majority rule, instead of failure

Comments Widely used algorithm Greedy Robust to noise (incorrect examples) Not incremental

Using Information Theory Rather than minimizing the probability of error, many existing learning procedures minimize the expected number of questions needed to decide if an object x satisfies CONCEPT This minimization is based on a measure of the “quantity of information” contained in the truth value of an observable predicate See R&N p

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve size of training set % correct on test set 100 Typical learning curve

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve size of training set % correct on test set 100 Typical learning curve Some concepts are unrealizable within a machine’s capacity

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set size of training set % correct on test set 100 Typical learning curve

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting –Tree pruning Terminate recursion when # errors / information gain is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting –Tree pruning Terminate recursion when # errors / information gain is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting –Tree pruning Incorrect examples Missing data Multi-valued and continuous attributes

Continuous Attributes Continuous attributes can be converted into logical ones via thresholds –X => X<a When considering splitting on X, pick the threshold to minimize # of errors

Learnable Concepts Some simple concepts cannot be represented compactly in DTs –Parity(x) = X 1 xor X 2 xor … xor X n –Majority(x) = 1 if most of X i ’s are 1, 0 otherwise Exponential size in # of attributes Need exponential # of examples to learn exactly The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

Applications of Decision Tree Medical diagnostic / Drug design Evaluation of geological systems for assessing gas and oil basins Early detection of problems (e.g., jamming) during oil drilling operations Automatic generation of rules in expert systems

Human-Readability DTs also have the advantage of being easily understood by humans Legal requirement in many areas –Loans & mortgages –Health insurance –Welfare

Ensemble Learning (Boosting) R&N 18.4

Idea It may be difficult to search for a single hypothesis that explains the data Construct multiple hypotheses (ensemble), and combine their predictions “Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988

Motivation 5 classifiers with 60% accuracy On a new example, run them all, and pick the prediction using majority voting If errors are independent, new classifier has 94% accuracy! –(In reality errors will not be independent, but we hope they will be mostly uncorrelated)

Boosting Weighted training set Ex. #WeightABCDECONCEPT 1w1w1 False TrueFalseTrueFalse 2w2w2 TrueFalse 3w3w3 True False 4w4w4 TrueFalse 5w5w5 True False 6w6w6 TrueFalseTrueFalse True 7w7w7 False TrueFalseTrue 8w8w8 FalseTrueFalseTrue 9w9w9 FalseTrue 10w 10 True 11w 11 True False 12w 12 True False TrueFalse 13w 13 TrueFalseTrue

Boosting Start with equal weights w i =1/N Use learner 1 to generate hypothesis h 1 Adjust weights to give higher importance to misclassified examples Use learner 2 to generate hypothesis h 2 … Weight hypotheses according to performance, and return weighted majority

Mushroom Example “Decision stumps” - single attribute DT Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

Mushroom Example Pick C first, learns CONCEPT = C Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

Mushroom Example Pick C first, learns CONCEPT = C Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

Mushroom Example Update weights Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Next try A, learn CONCEPT=A Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Next try A, learn CONCEPT=A Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Update weights Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Next try E, learn CONCEPT=E Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Next try E, learn CONCEPT=  E Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Update Weights… Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

Mushroom Example Final classifier, order C,A,E,D,B –Weights on hypotheses determined by overall error –Weighted majority weights A=2.1,  B=0.9, C=0.8, D=1.4,  E= % accuracy on training set

Boosting Strategies Prior weighting strategy was the popular AdaBoost algorithm see R&N pp. 667 Many other strategies Typically as the number of hypotheses increases, accuracy increases as well –Does this conflict with Occam’s razor?

Next Time Neural networks HW6 due