CART:Classification and Regression Trees Presented by; Pavla Smetanova Lütfiye Arslan Stefan Lhachimi Based on the book “Classification and Regression.

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Random Forest Predrag Radenković 3237/10
Brief introduction on Logistic Regression
CHAPTER 9: Decision Trees
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree.
Deriving rules from data Decision Trees a.j.m.m (ton) weijters.
Classification Techniques: Decision Tree Learning
Overview Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural measures of distance (e.g., Euclidean). Consider.
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Lecture outline Classification Decision-tree classification.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Predictive Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Ensemble Learning: An Introduction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Tree-based methods, neutral networks
Lecture 5 (Classification with Decision Trees)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Decision Tree Pruning Methods Validation set – withhold a subset (~1/3) of training data to use for pruning –Note: you should randomize the order of training.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Classification Part 4: Tree-Based Methods
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
K Nearest Neighbors Classifier & Decision Trees
Learning from observations
Today Ensemble Methods. Recap of the course. Classifier Fusion
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
Validation methods.
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Decision Tree Pruning problem of overfit approaches
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Classification with Gene Expression Data
Introduction to Machine Learning and Tree Based Methods
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Trees, bagging, boosting, and stacking
Ch9: Decision Trees 9.1 Introduction A decision tree:
ECE 471/571 – Lecture 12 Decision Tree.
Classification with CART
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Presentation transcript:

CART:Classification and Regression Trees Presented by; Pavla Smetanova Lütfiye Arslan Stefan Lhachimi Based on the book “Classification and Regression Trees” by L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984).

Outline 1- INTRODUCTION What is CART? An example Terminology Strengths 2- METHOD:3 steps in CART: Tree building Pruning The final tree

What is CART? A non-parametric technique,using the methodology of tree building. Classifies objects or predicts outcomes by selecting from a large number of variables the most important ones in determining the outcome variable. CART analysis is a form of binary recursive partitioning.

An example from Clinical research Development of a reliable clinical decision rule to classify new patients into categories 19 measurements(age, blood pressure, etc.)are taken from each heart-attack patients during the first 24 hours of their admittance to San Diego Hospital. The goal: identify high-risk patients

Classification of Patients as High or No risk groups Is the minimum systolic blod pressure over the initial 24 hour> 91? yesno Is age>62.5? yes no Is sinus tachycardia present ? yesno G F GF

Terminology The classification problem: A systematic way of predicting the class of an object based on measurements. C={1,...,J}: classes x: measurement vector d(x): a classifying function assigning every x to one of the classes 1,...,J.

Terminology ss: split learning sample (L): measurement data on N cases observed in the past together with their actual classification. R*(d): true misclassification rate R*(d)=P(d(x)=Y), Y  C

Strengths No distributional assumptions are required. No assumption of homogeneity. The explanatory variables can be a mixture of categorical, interval and continuous. Especially good for high-dimensional and large data sets. Produce useful results by using a few important variables.

Strengths Sophisticated methods for dealing with missing variables. Unaffected by outliers, collinearities, heteroscedascity. Not difficult to interpret. An important weakness: Not based on a probabilistic model, no confidence interval.

Dealing with Missing values CART does not drop cases with missing measurement values. s, s´tSurrogate Splits: Define a measurement of similarity between any two splits s, s´ of t. If best split of t is s on varible x m, find s´ on other variables that is most similar to s. Call it best surrogate of s. Find 2nd best, so on... If a case has x m missing, refer to surrogates.

3 Steps in CART 1.Tree building 2.Pruning 3.Optimal tree selection If the dependent variable is categorical then a classification tree and if it is continuous regression trees are used. Remark: Until the Regression part, we talk just about classification trees.

Example Tree 1 = root node = terminal node = non-terminal

Tree Building Process What is a tree? The collection of repeated splits of subsets of X into two descendant subsets. A finite non-empty set T and two functions left(.) and right(.) from t to T which satisfy; (i)For each t  T, either left(t)=right(t)=0,or left(t)>t and right(t)>t (ii)For each t  T, other than the smallest integer in T, there is exactly one s  T s.t. either t=left(s) or t=right(s).

Terminology of tree root of T: the minimum element of a tree st=left(s)ss: parent of T, if t=left(s) or t=right(s), t: t: child left(t)=right(t)=0T*: set of terminal nodes: left(t)=right(t)=0. T-T*: non-terminal nodes ancestors=parent(t) s=parent(parent(t))A node s is ancestor of t if s=parent(t) or s=parent(parent(t)) or...

tss tA node t is descendant of s, if s is an ancestor of t. TTt  T t t TA branch of T t of T with root node t  T consists of the node t and all descendants of t in T. LThe main problem of tree building: how to use the data L to determine the splits, the terminal nodes and assignment of terminals to classes.

Steps of tree building 1.Start with splitting a variable at all of its split points. Sample splits into two binary nodes at each split point. 2.Select the best split in the variable in terms of the reduction in impurity (heterogeneity) 3.Repeat steps 1,2 for all variables at the root node.

4.Rank all of the best splits and select the variable that achieves the highest purity at root. 5.Assign classes to the nodes according to a rule that minimizes misclassification costs. 6.Repeat 1-5 for each non-terminal node T 7.Grow a very large tree T max until all terminal nodes are either small or pure or contain identical measurement vectors. 8.Prune and choose final tree using the cross validation.

1-2 Construction of the classifier Goal: find a split, s, that divides L into so pure as possible subsets. Goodness of split criteria is the decrease in impurity:  i(s,t)=i(t)-p L i(t L )- p R i(t R ). i(t where i(t):node impurity, p L,p R; proportion of the cases that has been split to the left or right.

s*To extract the best split, choose the s* which fulfills;  i(s*,t)=max s  i(s,t) tRepeat the same till a node t is reached(optimization at each step) such that no significant decrease in purity is possible, declare it then as terminal node.

5-Estimating accuracy R*(d):dL L d(x).Concept of R*(d): Construct d using L. Draw another sample from the same population as L. Observe the correct classification, find the predicted classification using d(x). d R*(d).The proportion misclassified by d is the value of R*(d).

3 internal estimates of R*(d) 1.The resubstitution estimate(least accurate)  n I (d(x n )). R(d)=1/N  n I (d(x n )  j n ). 2.Test-sample estimate: (for large sample sizes) R ts (d)=1/N 2  (x n,j n ) I (d(x n ) ). R ts (d)=1/N 2  (x n,j n ) I (d(x n )  j n ). Cross-validation(preferred for smaller samples)Cross-validation(preferred for smaller samples) R ts (d (v) )=1/N v  (x n,j n ) I (d (v) (x n ) ). R ts (d (v) )=1/N v  (x n,j n ) I (d (v) (x n )  j n ). R CV (d)=1/V  v R ts (d (v) ). R CV (d)=1/V  v R ts (d (v) ).

7-Before Pruning Instead of finding appropriate stopping rules, grow a T max and prune it to the root. Then use R*(T) to select the optimal tree among pruned subtrees. Before pruning, for growing a sufficiently large initial tree T max specifies N min and split until each terminal node either is pure or N(t)  N min. Generally N min has been set at 5, occasionally at 1.

Tree TBranch T 2 Tree T-T 2 Definition : Pruning a branch T t from a tree T consists of deleting all descendants of t except its root node. T- T t is the pruned tree.

Minimal Cost-Complexity Pruning For any subtree T  T max, complexity |T| :the number of terminal nodes in T. Let   0, be a real number called the complexity parameter, a measure of how much additional accuracy a split must add to the entire tree to warrant the additional complexity. The cost-complexity measure R  (T) is a linear combination of the cost of the tree and its complexity. R  (T)=R(T)+  |T|.

For each value of α, find the subtree T(  ) which minimizes R  (T),i.e., R  (T(  ))=min T R  (T). For  =0, we have the T max. As  increases the tree become smaller, reducing down to the root at the extreme. Result is a finite sequence of subtrees T 1, T 2, T 3,... T k with progressively fewer terminal nodes.

Optimal Tree Selection Task: find the correct complexity parameter  so that the information in L is fit, but not overfit. This requires normally an independent set of data. If not available, use CROSS- Validation to pick out that subtree with the lowest estimated misclassification rate.

Cross-Validation VL randomly divided into V subsets, L 1,..., L V. v=1,...,V d (v) (x) R*(d (v) ) is;For every v=1,...,V; apply the procedure using L- L V as a learning sample and let d (v) (x) be the resulting classifier. A test sample estimate for R*(d (v) ) is; R ts (d (v) )=1/N v  (x n,j n ) I (d (v) (x n ) ). R ts (d (v) )=1/N v  (x n,j n ) I (d (v) (x n )  j n ). where v is the number of cases in where N v is the number of cases in L V.

Regression trees The basic idea same with classification. The regression estimator in the first step; The regression estimator in the second step;

Split R into R 1 and R 2 such that sum of squared residuals of the estimator is minimized; which is the counterpart of true misclassification rate in classification trees.

Comments Mostly used in clinical research, air pollution, criminal justice, molecular structures,... More accurate on nonlinear problems compared to linear regression. look at the data from different viewpoints.