Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Chapter 7 Classification and Regression Trees
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Random Forest Predrag Radenković 3237/10
Linear Regression.
Brief introduction on Logistic Regression
CHAPTER 9: Decision Trees
Logistic Regression Psy 524 Ainsworth.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Classification Techniques: Decision Tree Learning
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Ensemble Learning: An Introduction
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Three kinds of learning
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Classification and Prediction: Regression Analysis
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Ordinal Logistic Regression “Good, better, best; never let it rest till your good is better and your better is best” (Anonymous)
A Presentation on the Implementation of Decision Trees in Matlab
Classification Part 4: Tree-Based Methods
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Chapter 9 – Classification and Regression Trees
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Learning from Observations Chapter 18 Through
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Scaling up Decision Trees. Decision tree learning.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
1 Illustration of the Classification Task: Learning Algorithm Model.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Classification and Regression Trees
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Learning From Observations Inductive Learning Decision Trees Ensembles.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Chapter 7. Classification and Prediction
Ch9: Decision Trees 9.1 Introduction A decision tree:
Regression Analysis Week 4.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Classification with CART
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015

Regression Trees The usual set-up: numeric responses y 1, …, y n ; predictors X i for each y i –These might be numeric, categorical, logical… Start with all the y’s in one node; measure the “impurity” of that node (the extent to which the y’s are spread out) Object will be to divide the observations into sub-nodes of high purity (that is, with similar y’s)

Impurity Measure Any measure of impurity should be 0 when all y’s are the same; otherwise > 0 Natural choice for continuous y’s: compute predictions y-hat and take impurity as D =  i (y i – y-hat) 2 –RSS (deviance for Normal model) is preferable to SD; impurity should relate to sample size –This is “just right” if the y’s are Normal and unweighted (we care about everything equally)

Reducing Impurity R implementations in tree(), rpart() Both measure impurity by RSS by default Now consider each X column in turn 1.If X j is numeric, divide the Y’s into two pieces, one where X j  a and one where X j > a (a “split”) –Try every a; there are at most n – 1 of these E.g. Alcohol < 4; Alcohol < 4.5, etc.; Price < 3; Price < 3.5, etc; Sodium < 10, …

Reducing Impurity, cont’d In left and right “child” nodes, compute separate means y L -hat, y R -hat and separate deviances D L and D R The decrease in deviance (i.e. increase in purity) for this split is D – (D R + D L ) Our goal is to find the split for which this decrease is largest, or the sum of D L and D R is smallest (i.e. for which the two resulting nodes are purest)

Beer Example “Root” impurity is calories 2 6 n = 35 D = Alc  4 Alc > 4 n = 3 D = 1241 n = 32 D = 8995 Sum: n = 35 D = Alc  4.5 Alc > 4.5 n = 10 D = 7308 n = 25 D = 1584 Sum: 8892 n = 35 D = Alc  4.9 Alc > 4.9 n = 27 D = n = 8 D = 929 Sum: n = 35 D = Price  2.75 Price > 2.75 n = 21 D = n = 14 D = 3647 Sum: n = 35 D = Cost  0.46 Cost > 0.46 n = 21 D = n = 14 D = 3647 Sum: n = 35 D = Sod  10 Sod > 10 n = 11 D = 9965 n = 24 D = 9431 Sum: Best among these

Reducing Impurity 2.If X is categorical, split the data into two pieces, one with one subset of categories and one with the rest (a “split”) –For k categories, there are 2 k–1 – 1 of these E.g. divide men from women, (sub/air) from (surface/supply), (old/young) from (med.) Measure decrease in deviance exactly as before Select the best split among all possibilities –Subject to rules on minimum node size etc.

The Recursion Now split the two “child” nodes (say, #2 and #3) separately #2’s split will normally be different from #3’s; it could be on the same variable used at the root, but usually won’t be Split #2 into 4 and 5, #3 into 6 and 7, so as to decrease the impurity as much as possible; then split the resulting children –Node q’s children are numbered 2q, 2q+1

Some Practical Considerations 1.When do we stop splitting? –Clearly too much splitting  over-fitting –Stay tuned for this one 2.It feels bad to create child nodes with one observation – maybe even < 10 3.The addition of deviances implies we think they’re on the same scale – that is, we assume homoscedasticity when we use RSS as our impurity measure

Prediction For a new case, find the terminal node it falls into (based on its X’s) Predict the average of the y’s there –SDs are harder to come by! Diagnostic: residuals, fitted values, within- node variance vs. mean Instead of fitting a plane in the X space, we’re dividing it up into oblong pieces and assuming constant y in each piece

R Example R has tree and rpart libraries for trees Example 1: beer –There’s a tie for best split plot() plus text() draws pictures –Or use rpart.plot() from that library In this case, the linear model is better… –Unless you include both Price and Cost… –Tree model unaffected by “multi-collinearity” Trees are easy to understand; it’s easy to make nice pictures; almost no theoretical results 11

Tree Example (1985 cps wage) Training set (n = 427), test set (107) Step 1: produce tree for Wage –Notice heteroscedasticity, as reported by meanvar() –We hope for a roughly flat picture, indicating constant variance across leaves –In this case let’s take the log of Wage –Let tree stop splitting according to its defaults We will discuss these shortly!

R Interlude Wage tree using log of wage Compare performance to several linear models In this example the tree is not as strong as a reasonable linear model… …But sometimes it’s better… …And stay tuned for ensembles of models 13

Tree Benefits Trees are interpretable, easy to understand Extend naturally to classification, including the case with more than two classes Insensitive to monotonic transformation of X’s variables (unlike linear model) –Reduces impact of outliers Interactions included automatically Smart missing value handling –Both building the tree and predicting 14

Tree Benefits and Drawbacks (Almost) Entirely automatic model generation –Model is local, rather than global like regression No problem when columns overlap (e.g. beer) or # colums > # rows On the other hand… Abrupt decision boundaries look weird Some problems in life are approximately linear, at least after lots of analyst input No inference – just test set performance 15

Bias vs Variance 16 All Relationships Linear Linear Plus Best Linear Model Best LM with transformations, interactions… Best Tree True Model

Bias vs Variance 17 All Relationships Linear Linear Plus Best Linear Model Best LM with transformations, interactions… Bias Best Tree True Model

Stopping Rules Defaults –Do not split a node with  < 20 observations –Do not create a node with < 7 –R 2 must increase by.01 each step (this value is the complexity parameter cp) –Maximum depth = 30 (!) The plan is to intentionally grow a tree that’s too big, then “prune” it back to the optimal size using… Cross-validation!

Cross-validation in Rpart Cross-validation is done automatically –Results in the cp table of the rpart object 19 CPnsplitrel errorxerrorxstd Impurity as a fraction of the impurity at the root – always goes down for larger trees Cross-validated error: often goes down and then back up Find minimum xerror value… …And use corresponding cp (rounding upwards)

Pruning Recap In wage example: wage.rp <- rpart (LW ~., data = wage) plotcp (wage.rp) # show minimum prune (wage.rp, cp =.02) “Optimal” tree size is random – it depends on the cross-validation Or prune to one-SE rule by selecting the smallest size whose xerror < (min xerror + 1  corresponding SE) –Less variable? 20

Rpart Features rpart.control() function to set things like minimum leaf size, minimum within- leaf deviance For y’s like Poisson counts, compute Poisson deviance Methods also exist for exponential data (e.g. component lifetimes) with censoring Case weights can be applied Most importantly, rpart can also handle binary or multinomial categorical y’s

Missing Value Handling Missing values handling: Building: surrogate splits –Splits that “look like” the best split, to be used at prediction time if the “best” has NAs –Cases with missing values deleted, but… –(tree) na.tree.replace() for categoricals Predicting: use avg. y at stopping point We really want to avoid this because in real life lots of observations are missing at least a little data

Classification Two of the big problems in statistics: –Regression: estimate E(y i | X i ) when y is continuous (numeric) –Classification: predict y i | X i when y is categorical Example 1: two classes (“0” and “1”) One method: logistic regression Choose prediction threshold c, say 0.5 Predicted p > c  classify object into class 1; otherwise classify into class 0 For > 2 categories, logit can be extended

Classification Object: produce a rule that classifies new observations accurately –Or that assigns a good probability estimate –…Or at least rank-orders observations well Measure of quality: Area under the ROC curve, or misclassification rate, or deviance, or something else Issues: rare (or common) events are hard to classify, since the “null model” is so good –E.g. “No airplane will ever crash”

Other Classifiers

Classification Tree Same machinery as regression tree Still need to find optimal size of tree –As before, with plotcp/prune Example: Fisher iris data (3 classes); predict species based on measurements plot() + text(), rpart.plot() as before Example: spam data –Logistic regression tricky to build here

More on classification trees predict() works with a classification tree, just as with regression ones Follow each observation to its leaf; plurality “vote” determines classification By default, predict() produces a vector of class proportions in each obs.’s chosen node Use type="class" to choose the most probable class As always, there is more to a good model than just raw misclassification rate

Splitting Rules Several kinds of splits for classification trees in R It’s not a good idea to split so as to minimized misclassifcation rate Example: 151/200 75% “Yes” 52/100 52% “Yes” 99/100 99% “Yes” Misclass rate: 49/200 = Misclass rate: (1 + 48)/( ) = 0.245

Misclass Rate Minimizing misclass rate deprives us of splits that produce more homogeneous children, if all children continue to be majority 1 (or 0)… …So misclassification rate is particularly weak for very rare or very common events In our example, deviance starts at -2 [151 log (151/200) + 49 log (49/200)] = 222 and goes to -2 {[99 log (99/100) + 1 log (1/100)] + [52 log (52/100) + 48 log (48/100)]} = 150 Decrease !

Other Choices of Criterion Gini index within node t : 1 –  j p( j) 2 Smallest when one p is 1, the others 0 Gini split is then the weighted average of the two child-node Gini indices Information (also called “entropy”) E (t) = 1 –  p (j) log p (j) Compare to deviance, which is D (t)  –  n j log p (j) – not much difference Values not displayed in rpart printout

Multinomial Example Exactly the same setup when there are multiple classes –Whereas logistic regression gets lots more complicated –Other natural choice here: neural networks? Example: Optical digit data –No obvious stopping rule here (?) 31