Download presentation
Presentation is loading. Please wait.
Published byOswin Hicks Modified over 8 years ago
1
Regression Tree Ensembles Sergey Bakin
2
Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional vector): can be fixed design points or independently sampled from the same distribution. §y is numeric response variable. §Problem: estimate regression function E(y|x)=F(x) - can be very complex function.
3
Ensembles of Models §<1990’s: Multitude of techniques developed to tackle regression problems §1990’s: New idea - use collection of “basic” models (ensemble) §Substantial improvements in accuracy compared with any single “basic” model §Examples: Bagging, Boosting, Random Forests
4
Key Ingredients of Ensembles §Type of “basic” model used in the ensemble (RT, K-NN, NN) §The way basic models are built (data sub- sampling schemes, injection of randomness) §The way basic models are combined §Possible postprocessing (tuning) of resulting ensemble (optional)
5
Random Forests (RF) §Developed by Leo Brieman, Department of Statistics, University of California, Berkeley in late 1990’s. §RF resistant to overfitting §RF capable of handling large number of predictors
6
Key Features of RF §Randomised Regression Tree is a basic model §Each tree is grown on a bootsrap sample §Ensemble (Forest) is formed by averaging of predictions from individual trees
7
Regression Trees §Performs recursive binary division of data: start with Root node (all points) and split it into 2 parts (Left Node and Right Node) §Split attempts to separate data points with high y i ’s from data points with low y i ’s as much as possible §Split is based on a single predictor and a split point §To find the best splitter all possible splits and split points are tried §Splitting repeated for Children.
9
RT Competitor List Primary splits: x2 < -110.6631 to the right, improve=734.0907, (0 missing) x6 < 107.5704 to the left, improve=728.0376, (0 missing) x51 < 101.4707 to the left, improve=720.1280, (0 missing) x30 < -113.879 to the right, improve=716.6580, (0 missing) x67 < -93.76226 to the right, improve=715.6400, (0 missing) x78 < 93.27373 to the left, improve=715.6400, (0 missing) x62 < 93.99937 to the left, improve=715.6400, (0 missing) x44 < 96.059 to the left, improve=715.6400, (0 missing) x25 < -85.65475 to the right, improve=685.0943, (0 missing) x21 < -118.4764 to the right, improve=685.0736, (0 missing) x82 < 119.6532 to the left, improve=685.0736, (0 missing) x79 < -81.00349 to the right, improve=675.7913, (0 missing) x18 < -70.78995 to the right, improve=663.0757, (0 missing)
11
Predictions from a Tree model §Prediction from a tree is obtained by “dropping” x down the tree until it gets to a terminal node. §The predicted value is the average of the response values of training data points in that terminal node. §Example: if x 1 >=-110.66 and x 82 >=118.65 then Prediction=0.61
12
Pruning of CART trees §Prediction Error(PE) = Variance + Bias 2 §PE vs Tree Size has U-shape: very large and very small trees are bad §Trees are grown until terminal nodes become small... §... and then pruned back §Use holdout data to estimate PE of trees. §Select tree that has smallest PE.
13
Randomised Regression Trees I §Each tree is grown on a bootstrap sample: N data points are sampled with replacement §Each such sample contains ~63% of original data points - some records occur multiple times §Each tree is built on its own bootstrap sample - trees are likely to be different
14
Randomised Regression Trees II §At each split, only M randomly selected predictors are allowed to compete as potential splitters, i.e. 10 out of 100. §New group of eligible splitters is selected at random at each step. §At each step the splitter selected is likely to be somewhat suboptimal §Every predictor gets a chance to compete as a splitter: important predictors are very likely to be eventually used as splitters
15
Competitor List for Randomised RT Primary splits: x6 < 107.5704 to the left, improve=728.0376, (0 missing) x78 < 93.27373 to the left, improve=715.6400, (0 missing) x62 < 93.99937 to the left, improve=715.6400, (0 missing) x79 < -81.00349 to the right, improve=675.7913, (0 missing) x80 < 63.85983 to the left, improve=654.7728, (0 missing) x24 < 59.5085 to the left, improve=648.3837, (0 missing) x90 < -59.35043 to the right, improve=646.8825, (0 missing) x75 < -52.43783 to the right, improve=639.5996, (0 missing) x68 < 50.18278 to the left, improve=631.1139, (0 missing) Y < -33.42134 to the right, improve=606.9931, (0 missing) x34 < 132.8378 to the left, improve=555.2047, (0 missing)
16
Randomised Regression Trees III §M=1: splitter selected at random, but not the split point. §M=P: original deterministic CART algorithm §Trees deliberately are not pruned.
17
Combining the Trees §Each tree represents a regression model which fits the training data very closely: low bias, high variance model. §The idea behind RF: take predictions from large number of highly variable trees and average them. §The result is: low bias, low variance model
19
Correlation Vs Strength §Another decomposition for PE of Random Forests: PE(RF) < (BT)·PE(Tree) § (BT): correlation between any 2 trees in a forest §PE(Tree): prediction error (strength) of a single tree. §M=1: low correlation, low strength §M=P: high correlation, high strength
20
RF as K-NN regression model I §RF induces proximity measure in the predictor space P(x 1,x 2 )= Proportion of trees where x 1 and x 2 landed in the same terminal node. §Prediction at point x: §Only fraction of data points actually contributes to prediction. §Strongly resembles formula used for K-NN predictions
21
RF as K-NN regression model II §Lin, Y., Jeon, Y. Random Forests and Adaptive Nearest Neighbours, Technical Report 1055, Department of Statistics, University of Wisconsin, 2002. §Breiman, L. Consistency for a Simple Model of Random Forests. Technical Report 670, Statistics Department, University of California at Berkeley, 2004
22
It was shown that: §Randomisation does reduce variance component §Optimal M is independent of sample size §RF does behave as an Adaptive K-Nearest Neighbour model: shape and size of neighbourhood is adapted to the local behaviour of target regression function.
23
Case Study: Postcode Ranking in Motor Insurance §575 postcodes in NSW §For each postcode: number of claims as well as “exposure” - number of policies in a postcode §Problem: Ranking of postcodes for pricing purposes
24
Approach §Each postcode is represented by (x,y) coordinates of its centroid. §Model expected claim frequency as function of (x,y). §The target surface is likely to be highly irregular. §Add coordinates of postcodes along 100 randomly generated directions to allow greater flexibility.
25
Tuning RF: M
26
Tuning RF: Size of Node
32
Things not covered §Clustering, missing value imputation and outlier detection §Identification of important variables §OOB testing of RF models §Postprocessing of RF models
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.