Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Math 5364 Notes Chapter 4: Classification
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Model generalization Test error Bias, variance and complexity
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Regression Tree Learning Gabor Melli July 18 th, 2013.
Chapter 7 – Classification and Regression Trees
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Statistics for Managers Using Microsoft® Excel 5th Edition
Additive Models and Trees
Results Comparison with existing approaches on benchmark datasets Evaluation on a uveal melanoma datasetEvaluation on the two-spiral dataset Evaluation.
1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Classification Part 4: Tree-Based Methods
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
CART:Classification and Regression Trees Presented by; Pavla Smetanova Lütfiye Arslan Stefan Lhachimi Based on the book “Classification and Regression.
Chapter 9 – Classification and Regression Trees
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Xuhua Xia Slide 1 MANOVA All statistical methods we have learned so far have only one continuous DV and one or more IVs which may be continuous or categorical.
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
Learning from observations
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
1 Statistics & R, TiP, 2011/12 Tree-Based Methods  Methods for analyzing problems of discrimination and regression  Classification & Decision Trees For.
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Classification and Regression Trees
Model adequacy checking in the ANOVA Checking assumptions is important –Normality –Constant variance –Independence –Have we fit the right model? Later.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Decision Tree Lab. Load in iris data: Display iris data as a sanity.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Estimating standard error using bootstrap
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Evaluating Classifiers
Correlation, Bivariate Regression, and Multiple Regression
Erich Smith Coleman Platt
Ch9: Decision Trees 9.1 Introduction A decision tree:
Discriminant Analysis
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Figure 1.1 Rules for the contact lens data.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Principal Component Analysis (PCA)
R & Trees There are two tree libraries: tree: original
Classification with CART
STT : Intro. to Statistical Learning
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project

Overview of Tree-based Methods What are they? How do they work? Examples… Tree pictorials common. Simple way to depict relationships in data Tree-based methods use this pictorial to represent relationships between random variables.

Trees can be used for both Classification and Regression Presence of Surgery Complications vs. Patient Age and Treatment Start Date | Start >= 8.5 months Start >= 14.5 Age < 12 yrs Sex = F Start < 8.5 Start < 14.5 Age >= 12 yrs Sex = M Absent Time to Next Eruption vs. Length of Last Eruption | Last Eruption < 3.0 min Last Eruption < 4.1 min Absent Present

General Computation Issues and Unique Solutions Over-Fitting: When do we stop splitting? Stop generating new nodes when subsequent splits only result in little improvement. Evaluate the quality of the prediction: Prune the tree to ideally select the simplest most accurate solution. Methods: – Crossvalidation: Apply the tree computed from one set of observations (learning sample) to another completely independent set of observations (testing sample). – V-fold crossvalidation: Repeat the analysis with different randomly drawn samples from the data. Use the tree that shows the best average accuracy for cross-validated predicted classifications or predicted values.

Computational Details Specify the criteria for predictive accuracy – Minimum costs: Lowest misclassification rate – Case weights Selecting Splits – Define a measure of impurity for a node. A node is “pure” if they contain observations of a single class. Determine when to stop splitting – All nodes are pure or contain no more than a n cases – Until all nodes contain no more than a specified Fraction of Objects Selecting the “right-size” tree – Test sample cross validation – V Fold cross validation – Tree selection after pruning: if there are several trees with costs close to minimum, select the smallest-sized (least complex)

Computational Formulas Estimation of Accuracy in Classification Trees – Resubstitution estimate d(x) is the classifier X=1 if X(d(x n ) = j n ) is true X =0 if X(d(x n ) = j n ) is false Estimation of Accuracy in Regression Trees – Resubstitution estimate

Computational Formulas Estimation of Node Impurity Gini Index – Reaches zero when only one class is present at a node – P(j/t): probability of category j at node t Entropy or Information

Classification Tree Example: What species are these flowers? Sepal Length Sepal Width Petal Length Petal Width Versicolor Virginica Setosa tree

Iris Classification Data Iris dataset relates species to petal and sepal dimensions reported in centimeters. Originally used by R.A. Fisher and E. Anderson for a discriminant analysis example. Data is pre-packaged in R dataset library and is available on DASYL. Sepal.Length Sepal.Width Petal.Length Petal.Width Species versicolor versicolor virginica setosa setosa

Iris Classification Method and Code library(rpart)# Load tree fitting package data(iris)# Load iris data # Let x = tree object fitting Species vs. all other # variables in iris with 10-fold cross validation x = rpart(Species~.,iris,xval=10) # Plot tree diagram with uniform spacing, # diagonal branches, a 10% margin, and a title plot(x, uniform=T, branch=0, margin=0.1, main="Classification Tree\nIris Species by Petal and Sepal Length") # Add labels to tree with final counts, # fancy shapes, and blue text color text(x,use.n=T,fancy=T,col="blue")

Results: Classification Tree Iris Species by Petal and Sepal Length Petal.Length < 2.45 Petal.Width < 1.75 Petal.Length >= 2.45 Petal.Width >= 1.75 setosa 50/0/0 versicolor 0/49/5 virginica 0/1/45

Tree-based approach much simpler than the alternative Classification with Cross-validation True Group Put into Group setosa versicolor virginica setosa versicolor virginica Total N N correct Proportion N = 150 N Correct = 147 Linear Discriminant Function for Groups setosa versicolor virginica Constant Sepal.Length Sepal.Width Petal.Length Petal.Width Identify this flower… Sepal Length6 Sepal Width3.4 Petal Length4.5 Petal Width1.6 Setosa *6+24*3.4-16*4.5-17*1.6=41 Versicolor *6+7*3.4+5*4.5+6*1.6=80 Virginica *6+4*3.4+13*4.5+21*1.6=75 Since Versicolor has highest score, we classify this flower as an Iris versicolor. Classification Tree Iris Species by Petal and Sepal Length Petal Length < 2.45 Petal Width < 1.75 Petal Length >= 2.45 Petal Width >= 1.75 setosa 50/0/0 versicolor 0/49/5 virginica 0/1/45

Regression Tree Example Software used : R, rpart package Goal: – Applying the regression tree method on CPU data, and predicting the response variable, ‘ performance ’.

CPU Data CPU performance of 209 different processors. name syct mmin mmax cach chmin chmax perf 1 ADVISOR 32/ AMDAHL 470V/ AMDAHL 470/7A AMDAHL 470V/7B AMDAHL 470V/7C AMDAHL 470V/ Memory (kb) System Speed (mhz) Cache (kb) Channels Performance Benchmark

R Code library(MASS); library(rpart); data(cpus); attach(cpus) # Fit regression tree to data cpus.rp <-rpart(log(perf)~.,cpus[,2:8],cp=0.001) # Print and plot complexity Parameter (cp) table printcp(cpus.rp); plotcp(cpus.rp) # Prune and display tree cpus.rp<-prune(cpus.rp,cp=0.0055) plot(cpus.rp,uniform=T,main="Regression Tree") text(cpus.rp,digits=3) # Plot residual vs. predicted plot(predict(cpus.rp),resid(cpus.rp)); abline(h=0)

Determine the Best Complexity Parameter (cp) Value for the Model CP nsplit rel error xerror xstd – R 2 Cross- Validated Error cp X-val Relative Error Inf size of tree # Splits Complexity Parameter Cross- Validated Error SD

Regression Tree Regression Tree Before Pruning | cach< 27 mmax< 6100 mmax< 1750 mmax< 2500 chmax< 4.5 syct< 110 syct>=360 chmin< 5.5 cach< 0.5 chmin>=1.5 mmax< 1.4e+04 mmax< 2.8e+04 cach< 96.5 mmax< 1.124e+04 chmax< 14 cach< Regression Tree After Pruning | cach< 27 mmax< 6100 mmax< 1750 syct>=360 chmin< 5.5 cach< 0.5 mmax< 2.8e+04 cach< 96.5 mmax< 1.1e+04 cach<

How well does it fit? Plot of residuals

Summary Advantages of C & RT Simplicity of results: – The interpretation of results summarized in a tree is very simple. – This simplicity is useful for purposes of rapid classification of new observations – It is much easier to evaluate just one or two logical conditions. Tree methods are nonparametric and nonlinear – There is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non- linear link function

References Venables, Ripley (2002), Modern Applied Statistics with S, StatSoft (2003) “Classification and Regression Trees”, Electronic Textbook, StatSoft, 2003, retrieved on 11/8/2004 from Fisher, R. A. (1936) “The use of multiple measurements in taxonomic problems”. Annals of Eugenics, 7, Part II,

Using Trees in R (the 30 second version) 1)Load the rpart library library(rpart) 2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor) or consult a general R reference. y=as.factor(y) 3)Fit the tree model f=rpart(y~x1+x2+…,data=…,cp=0.001) If using an unattached dataframe, you must specify data. If using global variables, then data= can be omitted. A good starting point for cp, which controls the complexity of the tree, is given. 4)Plot and check the model plot(f,uniform=T,margin=0.1); text(f,use.n=T) plotcp(f); printcp(f) Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice. 5)Predict results predict(f,newdata,type=“class”) where newdata is a dataframe with the independent variables. Using Trees in R (the 30 second version) 1)Load the rpart library library(rpart) 2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor) or consult a general R reference. y=as.factor(y) 3)Fit the tree model f=rpart(y~x1+x2+…,data=…,cp=0.001) If using an unattached dataframe, you must specify data. If using global variables, then data= can be omitted. A good starting point for cp, which controls the complexity of the tree, is given. 4)Plot and check the model plot(f,uniform=T,margin=0.1); text(f,use.n=T) plotcp(f); printcp(f) Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice. 5)Predict results predict(f,newdata,type=“class”) where newdata is a dataframe with the independent variables. Using Trees in R (the 30 second version) 1)Load the rpart library library(rpart) 2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor) or consult a general R reference. y=as.factor(y) 3)Fit the tree model f=rpart(y~x1+x2+…,data=…,cp=0.001) If using an unattached dataframe, you must specify data. If using global variables, then data= can be omitted. A good starting point for cp, which controls the complexity of the tree, is given. 4)Plot and check the model plot(f,uniform=T,margin=0.1); text(f,use.n=T) plotcp(f); printcp(f) Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice. 5)Predict results predict(f,newdata,type=“class”) where newdata is a dataframe with the independent variables. Using Trees in R (the 30 second version) 1)Load the rpart library library(rpart) 2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor) or consult a general R reference. y=as.factor(y) 3)Fit the tree model f=rpart(y~x1+x2+…,data=…,cp=0.001) If using an unattached dataframe, you must specify data. If using global variables, then data= can be omitted. A good starting point for cp, which controls the complexity of the tree, is given. 4)Plot and check the model plot(f,uniform=T,margin=0.1); text(f,use.n=T) plotcp(f); printcp(f) Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice. 5)Predict results predict(f,newdata,type=“class”) where newdata is a dataframe with the independent variables.