Classification Part 4: Tree-Based Methods

Slides:

Advertisements

Similar presentations

Chapter 7 Classification and Regression Trees

Advertisements

Random Forest Predrag Radenković 3237/10

CHAPTER 9: Decision Trees

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

x – independent variable (input)

Sparse vs. Ensemble Approaches to Supervised Learning

Decision Tree Algorithm

Ensemble Learning: An Introduction

Additive Models and Trees

Tree-based methods, neutral networks

Lecture 5 (Classification with Decision Trees)

Classification 10/03/07.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods

Chapter 9 Additive Models，Trees，and Related Models

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Collaborative Filtering Matrix Factorization Approach

Lecture Notes 4 Pruning Zhangxi Lin ISQS

Classification Part 3: Artificial Neural Networks

Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

CART:Classification and Regression Trees Presented by; Pavla Smetanova Lütfiye Arslan Stefan Lhachimi Based on the book “Classification and Regression.

Chapter 9 – Classification and Regression Trees

Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Biophysical Gradient Modeling. Management Needs Decision Support Tools – Baseline Information Vegetation characteristics Forest stand structure Fuel loads.

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Learning from observations

Learning from Observations Chapter 18 Through

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Today Ensemble Methods. Recap of the course. Classifier Fusion

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Lecture Notes for Chapter 4 Introduction to Data Mining

Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:

CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,

Biostatistics Case Studies 2008 Peter D. Christenson Biostatistician Session 6: Classification Trees.

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning in Practice Lecture 18

Decision Trees an introduction.

BMTRY790: Machine Learning Summer 2017

Introduction to Machine Learning and Tree Based Methods

Ch9: Decision Trees 9.1 Introduction A decision tree:

Data Science Algorithms: The Basic Methods

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Introduction to Data Mining, 2nd Edition by

Collaborative Filtering Matrix Factorization Approach

Model generalization Brief summary of methods

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

STT : Intro. to Statistical Learning

Presentation transcript:

Classification Part 4: Tree-Based Methods BMTRY 726 4/18/14

So Far Linear Classifiers -Logistic regression: most commonly used -LDA: Uses more information (prior probabilities) and therefore more efficient but less interpretable output -Weaknesses: -modeling issues for larger p relative to n -inflexibility in decision boundary

So Far Neural Networks -Able to handle large p by reducing the number of inputs, X, to a smaller set of derived features, Z, used to model Y -Applying (non-linear) functions to linear combinations of the X’s and Z’s results in very flexible decision boundaries -Weaknesses: -Limited means of identifying important features -Tend to over-fit the data

Tree-Based Methods Non-parametric and semi-parametric methods Models can be represented graphically making them very easy to interpret We will consider two decision tree methods: -Classification and Regression Trees (CART) -Logic Regression

Tree-Based Methods Classification and Regression Trees (CART) -Used for both classification and regression -Features can be categorical or continuous -Binary splits selected for all features -Features may occur more than once in a model Logic Regression -Also used for classification and regression -Features must be categorical for these models

CART Main Idea: divide feature space into disjoint set of rectangles and fit a simple model to each one -For regression, the predicted output is just the mean of the training data in that region. -Fit a linear function in each region -For classification the predicted class is just the most frequent class in the training data in that region. -Estimate class probabilities by the frequencies in the training data in that region.

Feature Space as Rectangular Regions X1 < t1 X2 < t2 X1 < t3 X2 < t4 R1 R2 R3 R4 R5

CART Trees CART trees: graphical representations of a CART model Features in a CART tree: -Nodes/partitions: binary splits in the data based on one feature (e.g. X1 < t1) -Branches: paths leading down from a split -Go left if X follows the rule defines at the split -Go right if X doesn’t meet criteria -Terminal nodes: rectangular region where splitting ends X1 < t1 X2 < t2 X1 < t3 X2 < t4 R1 R2 R3 R4 R5

CART Tree Predictions Predictions from a tree are made from the top down For new observation with feature vector X, follow path through the tree that is true for the observed vector until reaching a terminal node Prediction matches prediction made at terminal node - In the case of classification, it is the class with the largest probability at that node X1 < t1 X2 < t2 X1 < t3 X2 < t4 R1 R2 R3 R4 R5

CART Tree Predictions For example, say our feature vector is X = (6, -1.5) Start at the top- which direction do we go? X1 < 4 X2 < -1 X1 < 5.5 X2 < -0.2 R1 R2 R3 R4 R5

Building a CART Tree Given a set of features X and associated outcomes Y how do we fit a tree? CART uses a Recursive Partitioning (aka Greedy Search) Algorithm to select the “best” feature at each split. So let’s look at the recursive partitioning algorithm.

Building a CART Tree Recursive Partitioning (aka Greedy Search) Algorithm (1) Consider all possible splits for each features/predictors -Think of split as a cutpoint (when X is binary this is always ½) (2) Select predictor and split on that predictor that provides the best separation of the data (3) Divide data into 2 subsets based on the selected split (4) For each subset, consider all possible splits for each feature (5) Select the predictor/split that provides best separation for each subset of data (6) Repeat this process until some desired level of data purity is reach  this is a terminal node!

Impurity of a Node Need a measure of impurity of a node to help decide on how to split a node, or which node to split The measure should be at a maximum when a node is equally divided amongst all classes The impurity should be zero if the node is all one class Consider the proportion of observations of class k in node m:

Measures of Impurity Possible measures of impurity: (1) Misclassification Rate -Situations can occur where no split improves the misclassification rate -The misclassification rate can be equal when one option is clearly better for the next step (2) Deviance/Cross-entropy: (3) Gini Index:

Tree Impurity Define impurity of a tree to be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree Example: Impurity of a tree with one single node, with both A and B having 400 cases, using the Gini Index: -Proportions of the two cases= 0.5 -Therefore Gini Index= (0.5)(1-0.5)+ (0.5)(1-0.5) = 0.5

Simple Example X1 X2 Class 0.0 0.8 2 8.0 -0.4 1 2.9 -1.4 4.9 1.0 9.0 1.2 -0.6 1.6 2.0 1.8 -1.8 7.0 -2.0 6.0 1.4 -0.1

X1 X2 Class 0.0 0.8 2 8.0 -0.4 1 2.9 -1.4 4.9 1.0 9.0 1.2 -0.6 1.6 2.0 1.8 -1.8 7.0 -2.0 6.0 1.4 -0.1 X2 < -1.0 Blue

X1 X2 Class 0.0 0.8 2 8.0 -0.4 1 2.9 -1.4 4.9 1.0 9.0 1.2 -0.6 1.6 2.0 1.8 -1.8 7.0 -2.0 6.0 1.4 -0.1 X2 < -1.0 Blue X1 < 3.9 Red

X1 X2 Class 0.0 0.8 2 8.0 -0.4 1 2.9 -1.4 4.9 1.0 9.0 1.2 -0.6 1.6 2.0 1.8 -1.8 7.0 -2.0 6.0 1.4 -0.1 X2 < -1.0 Blue X1 < 3.9 Red X1 < 5.45 Blue

X2 < -1.0 X1 X2 Class 0.0 0.8 2 8.0 -0.4 1 2.9 -1.4 4.9 1.0 9.0 1.2 -0.6 1.6 2.0 1.8 -1.8 7.0 -2.0 6.0 1.4 -0.1 Blue X1 < 3.9 Red X1 < 5.45 Blue X2 < -0.25 Blue Red

Pruning the Tree Over fitting CART trees grown to maximum purity often over fit the data Error rate will always drop (or at least not increase) with every split Does not mean error rate on test data will improve Best method of arriving at a suitable size for the tree is to grow an overly complex one then to prune it back Pruning (in classification trees) based on misclassification

Pruning the Tree Pruning parameters include (to name a few): (1) minimum number of observations in a node to allow a split (2) minimum number of observations in a terminal node (3) max depth of any node in the tree Use cross-validation to determine the appropriate tuning parameters Possible rules: -minimize cross validation relative error (xerror) -use the “1-SE rule” which uses the largest value of Cp with the “xerror” within one standard deviation of the minimum

Logic Regression Logic regression is an alternative tree-based method It requires exclusively binary features although the response can be binary or continuous Logic regression produces models that represent Boolean (i.e. logical) combinations of binary predictors -“AND”= -“OR” = -“NOT” = !Xj This allows logic regression to model complex interactions among binary predictors

Logic Regression Like CART, logic regression is a non-parametric/semi-parametric data driven method Like CART also able to model continuous and categorical responses Logic regression can also be used for survival outcomes (though I’ve never found an instance where someone actually did!) Unlike CART, logic regression requires that all features are binary when added to model -Features in CART model treated as binary BUT CART identifies the “best” split for a given feature to improve purity in the data Logic regression also uses a different fitting algorithm to identify the “best” model -Simulated Annealing vs. Recursive Partitioning

Comparing Tree Structures CART nodes: define splits on features in the tree branches: define relationship between features in the tree terminal nodes: Defined by a subgroup of the data (probability of each class in that node defines final prediction) predictions: made by starting at the top of the tree and going down the appropriate splits until a terminal node is reached Logic Regression knot (nodes): “AND”/ “OR” operators in the tree leaves (terminal nodes): Define the features (i.e. X’s) use in the tree predictions: the tree can be broken down into a set of logical relationships among features, if an observation matches one of the feature combinations defined in the tree, the class is predicted to be a 1

Logic Regression Branches in a CART tree can be thought of as “AND” combinations of the features at each node in the branch The only “OR” operator in a CART tree occurs at the first split Inclusion of “OR”, “AND”, and “NOT” operators in logic regression trees provides greater flexibility than is seen in CART models For data that include predominantly binary predictors (i.e. single nucleotide polymorphisms, yes/no survey data, etc.) logic regression is a better option than CART Consider an example…

CART vs. Logic Regression Trees Assume features are a set of binary predictors and the following combinations predict that an individual has disease (versus being healthy): (X1 AND X4) OR (X1 AND !X5) CART Logic Regression X1 < 0.5 OR AND AND X5 < 0.5 Predict C = 0 X1 X4 X1 !X5 X4 < 0.5 Predict C = 1 Predict C = 0 Predict C = 1

Logic Regression Trees Not Unique Logic regression trees are not necessarily unique. We can represent our same model with two seemingly different trees… (X1 AND X4) OR (X1 AND !X5) Logic Regression Tree 1 Logic Regression Tree 2 OR AND AND AND OR X1 X1 X4 X1 !X5 X4 !X5

Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 -LR is a statistical method capable of discovering logical combinations of predictors that describe a binary response -We will the model of urinary bladder cancer posed by Mitra et. al. as an example of how a LR model might look -The LR model pictured here describes the model shown in the picture -Note: a particular logic tree does not necessarily represent a unique representation of the model Mitra, A. P. et al. , 2006

Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 Mitra, A. P. et al. , 2006

Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 Mitra, A. P. et al. , 2006

Cancer Example of Logic Regression Logic Model for Stage P1 Cancer: Equivalent Logic Tree: AND H-ras FGFR3 9q- !p21 OR Rb !p53 -Note that all three interaction terms in the model include loss of p53 -If fail to capture this information when collecting data, may be difficult to correctly identify relationships among other predictors that are observed Mitra, A. P. et al. J Clin Oncol; 24:5552-5564 2006

Fitting a Logic Regression Model CART -considers one feature at a time Logic regression -considers all logical combinations of predictors (up to a fixed size) As a result… The search space can be very large Recursive partitioning could be used for logic regression, however, since it only considers one predictor at a time, it may fail to identify optimal combinations of predictors Simulated annealing is a stochastic learning algorithm that is a viable alternative to recursive partitioning

Simulated Annealing Original idea for the simulated annealing (SA) algorithm came from metallurgy -With metal mixtures, control strength, malleability, etc. by controlling temperature at which the metals anneal SA borrows from this idea to effectively search large feature space without getting “stuck” at local max/min SA a combination of a random walk and a Markov chain to conduct its search Rate of annealing controlled by: -number of random steps the algorithm takes -cooling schedule (includes staring and ending annealing “temperature”

Simulated Annealing (0) Select measure of model fit (scoring function) Select a random starting point in the search space (i.e. one possible combination of features) Randomly select of six possible moves to update the current model -Alternate a leaf; alternate an operator; grow a branch; prune a branch; split a leaf; delete a leaf Compare the new model to the old model using some measure of fit If the new model is better, accept unconditionally If the new model is worse, with some probability choose the new model -This probability related to the current annealing “temperature” Repeat for some set number of iterations (number of steps)

Six Allowable Moves Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic Regression. JCGS, 3(12), 475-511.

Simulated Annealing Measures of model fit: Misclassification (classification model) -Simulated annealing does not have the same issues for misclassification that recursive partitioning does Deviance/cross-entropy (logistic classification model) Sums-of-square error (regression model) Hazard function (survival) User specified model fit/loss function -You could write your own

Other Fitting Considerations Choosing annealing parameters (start/end temperatures, number of iterations) -Determine an appropriate set of annealing parameters prior to running model -Starting temperature: select to accept 90-95% of “worse” models -Ending temperature: select so >5% of worse models are accepted -Iterations is a matter of choice  more iterations = longer run time Choosing maximum number of leaves/trees -Good to determine appropriate number of leaves/trees before running final model -Can be done by cross-validation (select model size with smallest CV error)

Some Final Notes The Good: Tree-based methods are very flexible for classification and regression The models can be presented graphically for easier interpretation -clinicians like them, they tend to think this way anyway  CART is very useful in settings where features are continuous and binary Logic regression is useful for modeling data with all binary features (e.g. SNP data)

Some Final Notes The Bad (and sometimes ugly): They do have a tendency to over-fit the data -not uncommon among machine learning methods They are referred to as weak learners -small changes in the data results in very different models There has been extensive research into improving the performance of these methods which is the topic of our next class.