CpSc 810: Machine Learning Decision Tree Learning.

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Decision Tree Algorithm (C4.5)
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Tree Learning
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
Decision Trees Decision tree representation Top Down Construction
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Ch 3. Decision Tree Learning
Decision Tree Learning
Decision tree LING 572 Fei Xia 1/16/06.
Decision tree learning
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
CS 484 – Artificial Intelligence1 Announcements List of 5 source for research paper Homework 5 due Tuesday, October 30 Book Review due Tuesday, October.
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Machine Learning Lecture 10 Decision Tree Learning 1.
Learning from Observations Chapter 18 Through
Decision-Tree Induction & Decision-Rule Induction
Decision Tree Learning
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
CpSc 810: Machine Learning Concept Learning and General to Specific Ordering.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Seminar on Machine Learning Rada Mihalcea Decision Trees Very short intro to Weka January 27, 2003.
Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Università di Milano-Bicocca Laurea Magistrale in Informatica
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Decision trees (concept learnig)
Decision Tree Learning
Introduction to Machine Learning Algorithms in Bioinformatics: Part II
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees.
Decision Trees Berlin Chen
Presentation transcript:

CpSc 810: Machine Learning Decision Tree Learning

2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

3 Review Concept learning as search through H S and G boundaries characterize learner’s uncertainty A learner that make no a priori assumption regarding the identity of the target concept has no rational basis for classifying any unseen instance. Inductive leaps possible only if learner is biased

4 Decision Tree Representation for PlayTennis Outlook Humidity Sunny Overcast Rain High NormalStrong Weak A Decision Tree for the concept PlayTennis Wind

5 Decision Trees Decision tree representation Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assign a classification Decision tree represent a disjunction of conjunctions of constraints on the attribute values of instances.

6 Appropriate Problems for Decision Tree Learning Instances are represented by discrete attribute- value pairs (though the basic algorithm was extended to real-valued attributes as well) The target function has discrete output values (can have more than two possible output values --> classes) Disjunctive hypothesis descriptions may be required The training data may contain errors The training data may contain missing attribute values

7 ID3: The Basic Decision Tree Learning Algorithm Top-down induction of decision tree

8 Which attribute is best?

9 Choose the attribute that minimize the Disorder in the subtree rooted at a given node. Disorder and Information are related as follows: the more disorderly a set, the more information is required to correctly guess an element of that set.

10 Information theory Information: What is the best strategy for guessing a number from a finite set of possible numbers? i.e., how many questions do you need to ask in order to know the answer (we are looking for the minimal number of questions). Answer Log 2 (S), where S is the set of numbers and |S|, its cardinality. Q1: is it smaller than 5? Q2: is it smaller than 2? E.g.: Q1Q2

11 Information theory Information theory: optimal length code assigns -Log 2 p bits to message having probability p. So expected number of bits to encode + or – of random member of S is

12 Entropy S is a set of training examples p+ is the proportion of positive examples in S p- is the proportion of negative examples in S Entropy measures the impurity of S Entropy(S) = expected number of bits needed to encode class (+ or -) of randomly drawn member of S (under the optimal, shortest-length code

13 Information Gain Gain(S, A) = expected reduction in entropy due to sorting on A

14 Training Examples Entropy?

15 Which attribute is the best? S: [9+. 5-] E=0.94 Humidity [3+, 4-][6+, 1-] HighNormal E=0.985 E=0.592 Gain(S, Humidity) = 0.94 – (7/14)*0.985-(7/14)*0.592 = 0.151

16 Which attribute is the best? S: [9+. 5-] E=0.94 Wind [6+, 2-][3+, 3-] WeakStrong E=0.811 E=1.0 Gain(S, Wind) = 0.94 – (8/14)*0.811-(6/14)*1.0 = 0.048

17 Which attribute is the best? S: [9+. 5-] E=0.94 Temperature [2+, 2-][3+, 1-] HotCool E=1.0 E=0.811 Gain(S, Temperature) = 0.94 – (4/14)*1.0-(6/14)*0.918-(4/14)*0.811 = [4+, 2-] E=0.918 Mild

18 Which attribute is the best? S: [9+. 5-] E=0.94 Outlook [2+, 3-][3+, 2-] SunnyRain E=0.970 Gain(S, Outlook) = 0.94 – (5/14)*0.970-(4/14)*0.0-(5/14)*0.970 = [4+, 0-] E=0 Overcast

19 Which attribute is the best? Gain(S, Humidity) = Gain(S, Wind) = Gain(S, Temperature) = Gain(S, Outlook) = Outlook

20 Selecting the Next attribute S: [9+. 5-] E=0.94 Outlook [2+, 3-][3+, 2-] SunnyRain E=0.970 [4+, 0-] E=0 Overcast ?? Which attribute should be tested here?

21 Selecting the Next attribute S: [9+. 5-] E=0.94 Outlook [2+, 3-][3+, 2-] SunnyRain E=0.970 [4+, 0-] E=0 Overcast ?? Gain(S sunny, Humidity) = 0.97-(3/5)*0.0-(2/5)*0.0=0.97 Gain(S sunny, Temperature) = 0.97-(3/5)*0.0-(2/5)*1.0-(1/5)0.0=0.57 Gain(S sunny, Wind) = 0.97-(2/5)*1.0-(3/5)*0.918=0.019

22 Hypothesis Space Search in ID3 Hypothesis Space: Set of possible decision trees. Hypotheses space is complete space of finite discrete-valued functions. The target function surely in there. Search Method: Simple-to- Complex Hill-Climbing Search only output a single hypothesis (  from candidate- elimination method). Unable to explicitly represent all consistent hypotheses. No Backtracking!!! – Local minima.

23 Hypothesis Space Search in ID3 Evaluation Function: Information Gain Measure Batch Learning: ID3 uses all training examples at each step to make statistically- based decisions (  from candidate- elimination method which makes decisions incrementally). the search is less sensitive to errors in individual training examples.

24 Inductive Bias in ID3 Note H is the power set of instance X -> unbiased? Not Really Preference for short trees Preference for trees with high information gain attributes near the root. Restriction Bias vs. Preference Bias ID3 search a complete hypothesis space, but it searches incompletely through this space. (preference or search bias) Candidate-Elimination searches an incomplete hypothesis space. However, it search this space completely. (restriction or language bias)

25 Why Prefer Short Hypotheses? Occam’s razor: Prefer the simplest hypothesis that fits the data [William Occam (Philosopher), circa 1320] Scientists seem to do that: E.g., Physicist seem to prefer simple explanations for the motion of planets, over more complex ones Argument: Since there are fewer short hypotheses than long ones, it is less likely that one will find a short hypothesis that coincidentally fits the training data. Problem with this argument: it can be made about many other constraints. Why is the “short description” constraint more relevant than others? Two learner use different representation will result different hypotheses. Nevertheless: Occam’s razor was shown experimentally to be a successful strategy.

26 Overfitting Consider error of hypotheses h over Training data: error train (h) Entire distribution D of data: error D (h) Hypothesis h H overfits training data if there is an alternative hypothesis h’ H such that and

27 Overfitting in Decision Trees Consider adding a noisy training example to the PlayTennis What effect the earlier tree?

28 Overfitting in Decision Trees S: [9+. 5-] E=0.94 Outlook [2+, 3-][3+, 2-] SunnyRain E=0.970 [4+, 0-] E=0 Overcast HumidityWind HighNormal {D1, D2, D8} {D9, D11} [3-] [2+] New Data add a - here {D9, D11, D15} [2+, 1-]

29 Overfitting in Decision Trees

30 Avoiding Overfitting How can we avoid overfitting? Stop growing the tree when data split not statistically significant Grow full tree, then post-prune it. How to select “best” tree: Measure performance over training data Apply a statistical test to estimate whether pruning is likely to improve performance. Measure performance over separate validation data set Minimum Description Length (MDL): Minimize size(tree)+size(misclassifications(tree))

31 Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it). 2. Greedily remove the on that most improves validation set accuracy Produces smallest version of most accurate tree Drawback: what if data is limited?

32 Effect of Reduced-Error Pruning

33 Rule Post-Pruning Most frequently used method

34 Converting a Tree to Rules IF(Outlook = Sunny) ∩ (Humidity = High) ThenPlayTennis = No IF(Outlook = Sunny) ∩ (Humidity = Normal) ThenPlayTennis = Yes

35 Why converting to rule Converting to rules allows distinguishing among the different contexts in which a decision node is used Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those occur near the leaves. Converting to rules improves readability.

36 Incorporating Continuous-Valued Attributes For a continuous valued attribute A, create a new boolean attribute Ac that is true if A < c and false otherwise. Select the threshold c that produces the greatest information gain. Example: Two possible threshold c1=(48+60)/2=54; c2=(80+90)/2=85 Which one is better? (homework)

37 Alternative Measures for Selecting Attributes: GainRatio Problem: information gain measure favors attributes with many values. Example: add a Date attribute in our example One solution: GainRatio Split Information Where S i is subset of S for which A have value v i Gain Ratio

38 Attributes with Differing Costs In some learning tasks the instance attributes may have associated cost. How to learn a consistent tree with low expected cost? Approaches: Tan and Schlimmer (1990) Nunez(1988) Where determines importance of cost

39 Training Examples with Missing Attribute Values If some examples missing values of attribute A, use training examples anyway. Assign most common value of A among other examples with same target values Assign probability p i to each possible v i value of A. Then, assign fraction p i of example to each descendant in tree. Classify new examples in same fashion.

40 Summary Points Decision tree learning provides a practical method for concept learning and for learning other discrete-valued functions ID3 searches a complete hypothesis space. Target function always present in the hypothesis space. The inductive bias implicit in ID3 includes a preference for smaller trees. Overfitting the training data is an important issue in decision tree learning. A large variety of extensions to the basic ID3 algorithm has been developed.