CHAPTER 9: Decision Trees

Slides:

Advertisements

Similar presentations

Chapter 7 Classification and Regression Trees

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

Random Forest Predrag Radenković 3237/10

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Decision Tree Approach in Data Mining

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.

1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.

Classification Techniques: Decision Tree Learning

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,

Lecture outline Classification Decision-tree classification.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CSci 8980: Data Mining (Fall 2002)

Ensemble Learning: An Introduction

Tree-based methods, neutral networks

Lecture 5 (Classification with Decision Trees)

Three kinds of learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

Classification.

Learning Chapter 18 and Parts of Chapter 20

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Chapter 9 – Classification and Regression Trees

Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.

Review - Decision Trees

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

1 CSCI 3202: Introduction to AI Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AIDecision Trees.

Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Lecture Notes for Chapter 4 Introduction to Data Mining

Machine Learning: Decision Trees Homework 4 assigned courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

DECISION TREES An internal node represents a test on an attribute.

Artificial Intelligence

Trees, bagging, boosting, and stacking

Ch9: Decision Trees 9.1 Introduction A decision tree:

Data Science Algorithms: The Basic Methods

Data Mining Classification: Basic Concepts and Techniques

Introduction to Data Mining, 2nd Edition by

Introduction to Data Mining, 2nd Edition by

Introduction to Data Mining, 2nd Edition by

Basic Concepts and Decision Trees

Statistical Learning Dong Liu Dept. EEIS, USTC.

INTRODUCTION TO Machine Learning

INTRODUCTION TO Machine Learning 2nd Edition

Presentation transcript:

CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick

Tree Uses Nodes, and Leaves Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi Numeric xi : Binary split : xi > wm Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) without backtracking (decision taken are final)

Side Discussion “Greedy Algorithms” Fast and therefore attractive to solve NP-hard and other problems with high complexity. Later decisions are made in the context of decision selected early dramatically reducing the size of the search space. They do not backtrack: if they make a bad decision (based on local criteria), they never revise the decision. They are not guaranteed to find the optimal solutions, and sometimes can get deceived and find really bad solutions. In spite of what is said above, a lot successful and popular algorithms in Computer Science are greedy algorithms. Greedy algorithms are particularly popular in AI and Operations Research. Popular Greedy Algorithms: Decision Tree Induction,…

Classification Trees (ID3, CART, C4.5) For node m, Nm instances reach m, Nim belong to Ci Node m is pure if pim is 0 or 1 Measure of impurity is entropy Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Best Split skip If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Tree Induction Greedy strategy. Issues Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

Information Gain vs. Gain Ratio Result: I_Gain_Ratio: city>age>car Result: I_Gain: age > car=city Gain(D,city=)= H(1/3,2/3) – ½ H(1,0) – ½ H(1/3,2/3)=0.45 D=(2/3,1/3) G_Ratio_pen(city=)=H(1/2,1/2)=1 city=sf city=la D1(1,0) D2(1/3,2/3) Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) – ½ H(2/3,1/3) – 1/3 H(1,0)=0.45 G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45 D=(2/3,1/3) car=van car=merc car=taurus D1(0,1) D2(2/3,1/3) D3(1,0) G_Ratio_pen(age=)=log2(6)=2.58 Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1) = 0.90 D=(2/3,1/3) age=22 age=25 age=27 age=35 age=40 age=50 D1(1,0) D2(0,1) D3(1,0) D4(1,0) D5(1,0) D6(0,1)

How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Before: E(1/2,1/2) After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8) Gain: Before-After Pick Test that has the highest gain! Remark: E stands for Gini, Entropy(H), Impurity(1-maxc(P(c )), Gain-ratio

Entropy and Gain Computations Assume we have m classes in our classification problem. A test S subdivides the examples D= (p1,…,pm) into n subsets D1 =(p11,…,p1m) ,…,Dn =(p11,…,p1m). The qualify of S is evaluated using Gain(D,S) (ID3) or GainRatio(D,S) (C5.0): Let H(D=(p1,…,pm))= Si=1 (pi log2(1/pi)) (called the entropy function) Gain(D,S)= H(D) - Si=1 (|Di|/|D|)*H(Di) Gain_Ratio(D,S)= Gain(D,S) / H(|D1|/|D|,…, |Dn|/|D|) Remarks: |D| denotes the number of elements in set D. D=(p1,…,pm) implies that p1+…+ pm =1 and indicates that of the |D| examples p1*|D| examples belong to the first class, p2*|D| examples belong to the second class,…, and pm*|D| belong the m-th (last) class. H(0,1)=H(1,0)=0; H(1/2,1/2)=1, H(1/4,1/4,1/4,1/4)=2, H(1/p,…,1/p)=log2(p). C5.0 selects the test S with the highest value for Gain_Ratio(D,S), whereas ID3 picks the test S for the examples in set D with the highest value for Gain (D,S). m n

Information Gain vs. Gain Ratio Result: I_Gain_Ratio: city>age>car Result: I_Gain: age > car=city Gain(D,city=)= H(1/3,2/3) – ½ H(1,0) – ½ H(1/3,2/3)=0.45 D=(2/3,1/3) G_Ratio_pen(city=)=H(1/2,1/2)=1 city=sf city=la D1(1,0) D2(1/3,2/3) Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) – ½ H(2/3,1/3) – 1/3 H(1,0)=0.45 G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45 D=(2/3,1/3) car=van car=merc car=taurus D1(0,1) D2(2/3,1/3) D3(1,0) G_Ratio_pen(age=)=log2(6)=2.58 Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1) = 0.90 D=(2/3,1/3) age=22 age=25 age=27 age=35 age=40 age=50 D1(1,0) D2(0,1) D3(1,0) D4(1,0) D5(1,0) D6(0,1)

Splitting Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), clustering, or supervised clustering. Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut v can be more compute intensive

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Stopping Criteria for Tree Induction Grow entire tree Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have the same attribute values Pre-pruning (do not grow complete tree) Stop when only x examples are left (pre-pruning) … other pre-pruning strategies

How to Address Overfitting in Decision Trees The most popular approach: Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree

Advantages Decision Tree Based Classification Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Okay for noisy data Can handle both continuous and symbolic attributes Accuracy is comparable to other classification techniques for many simple data sets Decent average performance over many datasets Kind of a standard—if you want to show that your “new” classification technique really “improves the world”  compare its performance against decision trees (e.g. C 5.0) using 10-fold cross-validation Does not need distance functions; only the order of attribute values is important for classification: 0.1,0.2,0.3 and 0.331,0.332, and 0.333 is the same for a decision tree learner.

Disadvantages Decision Tree Based Classification Relies on rectangular approximation that might not be good for some dataset Selecting good learning algorihtm parameters (e.g. degree of pruning) is non-trivial Ensemble techniques, support vector machines, and kNN might obtain higher accuracies for a specific dataset. More recently, forests (ensembles of decision trees) have gained some popularity.

Error in Regressions Trees Examples are: (1,1.37), (2,-1.45), (2.1 -1.35), (2.2,-125) (7, 2.06), (8, 1.66) Error before split: 1/6(2*0**2+ 2*0.01 +2*0.04) In general: minimizes squared error!

Regression Trees Error at node m: After splitting: Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Error in Regressions Trees Examples are: (1,1.37), (2,-1.45), (2.1 -1.35), (2.2,-125) (7, 2.06), (8, 1.66) Error before split: 1/6(2*0**2+ 2*0.01 +2*0.04)

Model Selection in Trees: Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Summary Regression Trees Regression trees work as decision trees except Leafs carry numbers which predict the output variable instead of class label Instead of entropy/purity the squared prediction error serves as the performance measure. Tests that reduce the squared prediction error the most are selected as node test; test conditions have the form y>threshold where y is an independent variable of the prediction problem. Instead of using the majority class, leaf labels are generated by averaging over the values of the dependent variable of the examples that are associated with the particular class node Like decision trees, regression tree employ rectangular tessellations with a fixed number being associated with each rectangle; the number is the output for the inputs which lie inside the rectangle. In contrast to ordinary regression that performs curve fitting, regression tries use averaging with respect to the output variable when making predictions.

Multivariate Trees Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)