CS 9633 Machine Learning Decision Tree Learning

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Tree Learning
Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Tree Learning
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
Decision Tree Learning
Decision tree learning
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Artificial Intelligence 7. Decision trees
Mohammad Ali Keyvanrad
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Decision-Tree Induction & Decision-Rule Induction
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Decision Trees.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Università di Milano-Bicocca Laurea Magistrale in Informatica
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Decision trees (concept learnig)
Classification Algorithms
Decision Tree Learning
CS 9633 Machine Learning Concept Learning
Artificial Intelligence
Machine Learning: Decision Tree Learning
Data Science Algorithms: The Basic Methods
Introduction to Machine Learning Algorithms in Bioinformatics: Part II
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees.
Decision Trees Decision tree representation ID3 learning algorithm
Artificial Intelligence 6. Decision Tree Learning
Decision Trees Berlin Chen
Decision Trees Jeff Storey.
Presentation transcript:

CS 9633 Machine Learning Decision Tree Learning References: Machine Learning by Tom Mitchell, 1997, Chapter 3 Artificial Intelligence: A Modern Approach, by Russell and Norvig, Second Edition, 2003, pages C4.5: Programs for Machine Learning, by J. Ross Quinlin, 1993. Computer Science Department CS 9633 KDD

Decision Tree Learning Approximation of discrete-valued target functions Learned function is represented as a decision tree. Trees can also be translated to if-then rules Computer Science Department CS 9633 KDD

Decision Tree Representation Classify instances by sorting them down a tree Proceed from the root to a leaf Make decisions at each node based on a test on a single attribute of the instance The classification is associated with the leaf node Computer Science Department CS 9633 KDD

Outlook Sunny Overcast Rain Humidity Wind Yes High Normal Strong Weak <Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Weak>

Computer Science Department Representation Disjunction of conjunctions of constraints on attribute values Each path from the root to a leaf is a conjunction of attribute tests The tree is a disjunction of these conjunctions Computer Science Department CS 9633 KDD

Computer Science Department Appropriate Problems Instances are represented by attribute-value pairs The target function has discrete output values Disjunctive descriptions are required The training data may contain errors The training data may contain missing attribute values Computer Science Department CS 9633 KDD

Basic Learning Algorithm Top-down greedy search through space of possible decision trees Exemplified by ID3 and its successor C4.5 At each stage, we decide which attribute should be tested at a node. Evaluate nodes using a statistical test. No backtracking Computer Science Department CS 9633 KDD

ID3(Examples, Target_attribute, Attributes) Create a Root node for the tree If all examples are positive, return the single node tree Root, with label + If all examples are negative, return the single node tree Root, with label – If Attributes is empty, return the single-node tree Root, with label = most common value of Target_Attribute in Examples Otherwise Begin A  the number of attribute that best classifies Examples The decision attribute for Root  A For each possible value, vi for A Add a new tree branch below Root corresponding to the test A = vi Let Examplesvi be the subset of Examples that have value vi for A If Examples is Empty Then Below this new branch add a leaf node Else Below this new branch add the subtree ID3(Examplesvi, Target_attribute, Attributes – {A}) End Return Root

Selecting the “Best” Attribute Need a good quantitative measure Information Gain Statistical property Measures how well an attribute separates the training examples according to target classification Based on entropy measure Computer Science Department CS 9633 KDD

Entropy Measure Homogeneity Entropy characterizes the impurity of an arbitrary collection of examples. For two class problem (positive and negative) Given a collection S containing + and – examples, the entropy of S relative to this boolean classification is: Computer Science Department CS 9633 KDD

Computer Science Department Examples Suppose S contains 4 positive examples and 60 negative examples Entropy(4+,60-)= Suppose S contains 32 positive examples and 32 negative examples Entropy(32+,32-)= Suppose S contains 64 positive examples and 0 negative examples Entropy(64+,0-)= Computer Science Department CS 9633 KDD

Computer Science Department General Case Computer Science Department CS 9633 KDD

From Entropy to Information Gain Information gain measures the expected reduction in entropy caused by partitioning the examples according to this attribute Computer Science Department CS 9633 KDD

Customer ID Debt Income Marital Status Risk Abel High Married Good Ben Low Doubtful Candy Medium Unmarried Poor Dale Ellen Fred George Harry Igor Jack Kate Lane Mary Nancy Othello

Marital Status Debt Income Low Medium High Low Medium High S: [(G,4)(D,5)(P,6)] E = Marital Status Debt Income Low Medium High Low Medium High Unmarried Married

Hypothesis Space Search Hypothesis space: Set of possible decision trees Simple to complex hill-climbing Evaluation function for hill-climbing is information gain Computer Science Department CS 9633 KDD

Capabilities and Limitations Hypothesis space is complete space of finite discrete-valued functions relative to the available attributes. Single hypothesis is maintained No backtracking in pure form of ID3 Uses all training examples at each step Decision based on statistics of all training examples Makes learning less susceptible to noise Computer Science Department CS 9633 KDD

Computer Science Department Inductive Bias Hypothesis bias Search bias Shorter trees are preferred over longer ones Trees with attributes with the highest information gain at the top are preferred Computer Science Department CS 9633 KDD

Why Prefer Short Hypotheses? Occam’s razor: Prefer the simplest hypothesis that fits the data Is it justified? Commonly used in science There are a smaller number of small hypothesis than larger ones But some large hypotheses are also rare Description length influences size of hypothesis Evolutionary argument Computer Science Department CS 9633 KDD

Computer Science Department Overfitting Definition: Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h’ over the training examples, but h’ has a smaller error than h over the entire distribution of instances. Computer Science Department CS 9633 KDD

Computer Science Department Avoiding Overfitting Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data Allow the tree to overfit the data, and then post-prune the tree Computer Science Department CS 9633 KDD

Criterion for Correct Final Tree Size Use a separate set of examples (test set) to evaluate the utility of post-pruning Use all available data for training, but apply a statistical test to estimate whether expanding (or pruning) is likely to produce improvement. (chi-square test used by Quinlan at first—later abandoned in favor of post-pruning) Use explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding size is minimized (Minimum Description Length principle). Computer Science Department CS 9633 KDD

Computer Science Department Two types of pruning Reduced error pruning Rule post-pruning Computer Science Department CS 9633 KDD

Computer Science Department Reduced Error Pruning Decision nodes are pruned from final tree Pruning a node consists of Remove sub-tree rooted at the node Make it a leaf node Assign most common classification of the training examples associated with the node Remove nodes only if the resulting pruned tree performs no worse than the original tree over the validation set. Pruning continues until it is harmful Computer Science Department CS 9633 KDD

Computer Science Department Rule Post-Pruning Infer the decision tree from the training set—allow overfitting Convert tree into equivalent set of rules Prune each rule by removing preconditions that result in improving its estimated accuracy Sort the pruned rules by estimated accuracy and consider them in order when classifying Computer Science Department CS 9633 KDD

If (Outlook = Sunny)  ( Humidity = High) Then (PlayTennis = No) Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes If (Outlook = Sunny)  ( Humidity = High) Then (PlayTennis = No)

Why convert the decision tree to rules before pruning? Allows distinguishing among the different contexts in which a decision node is used Removes the distinction between attribute tests near the root and those that occur near leaves Enhances readability Computer Science Department CS 9633 KDD

Continuous Valued Attributes For a continuous variable A, establish a new Boolean variable Ac that tests if the value of A is less than c A < c How do select a value for the threshold c? Temp 40 48 60 72 80 90 Play Tennis No Yes Computer Science Department CS 9633 KDD

Computer Science Department Identification of c Sort instances by continuous value Find boundaries where the target classification changes Generate candidate thresholds between boundary Evaluate the information gain of the different thresholds Computer Science Department CS 9633 KDD

Alternative methods for selecting attributes Information gain has natural bias for attributes with many values Can result in selecting an attribute that works very well with training data but does not generalize Many alternative measures have been used Gain ratio (Quinlan 1986) Computer Science Department CS 9633 KDD

Missing Attribute Values Suppose we have instance <x1, c(x1)> at a node (among other instances) We want to find the gain if we split using attribute A and A(x1) is missing. What should we do? Computer Science Department CS 9633 KDD

Computer Science Department 2 simple approaches Assign the missing attribute the most common value among the examples at node n Assign the missing attribute the most common value among the examples at node n with classification c(x) Node A <blue,…,yes> <red,…, no> <blue,…, yes> <?,…,no> Computer Science Department CS 9633 KDD

More complex procedure Assign a probability to each of the possible values of A based on frequencies of values of A at node n. In previous example, probabilities would be 0.33 red and 0.67 blue. Distribute fractional instances down the tree and use fractional values to compute information gain. Can also use these fractional values to compute information gain This is the method used by Quinlan Computer Science Department CS 9633 KDD

Attributes with different costs Often occurs in diagnostic settings Introduce a cost term into the attribute selection measure Approaches Divide Gain by the cost of the attribute Tan and Schlimmer: Gain2(S,A)/Cost(A) Nunez: (2Gain(S,A)-1)/(Cost(A) + 1)w Computer Science Department CS 9633 KDD