Regression Tree Ensembles Sergey Bakin
Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional vector): can be fixed design points or independently sampled from the same distribution. §y is numeric response variable. §Problem: estimate regression function E(y|x)=F(x) - can be very complex function.
Ensembles of Models §<1990’s: Multitude of techniques developed to tackle regression problems §1990’s: New idea - use collection of “basic” models (ensemble) §Substantial improvements in accuracy compared with any single “basic” model §Examples: Bagging, Boosting, Random Forests
Key Ingredients of Ensembles §Type of “basic” model used in the ensemble (RT, K-NN, NN) §The way basic models are built (data sub- sampling schemes, injection of randomness) §The way basic models are combined §Possible postprocessing (tuning) of resulting ensemble (optional)
Random Forests (RF) §Developed by Leo Brieman, Department of Statistics, University of California, Berkeley in late 1990’s. §RF resistant to overfitting §RF capable of handling large number of predictors
Key Features of RF §Randomised Regression Tree is a basic model §Each tree is grown on a bootsrap sample §Ensemble (Forest) is formed by averaging of predictions from individual trees
Regression Trees §Performs recursive binary division of data: start with Root node (all points) and split it into 2 parts (Left Node and Right Node) §Split attempts to separate data points with high y i ’s from data points with low y i ’s as much as possible §Split is based on a single predictor and a split point §To find the best splitter all possible splits and split points are tried §Splitting repeated for Children.
RT Competitor List Primary splits: x2 < to the right, improve= , (0 missing) x6 < to the left, improve= , (0 missing) x51 < to the left, improve= , (0 missing) x30 < to the right, improve= , (0 missing) x67 < to the right, improve= , (0 missing) x78 < to the left, improve= , (0 missing) x62 < to the left, improve= , (0 missing) x44 < to the left, improve= , (0 missing) x25 < to the right, improve= , (0 missing) x21 < to the right, improve= , (0 missing) x82 < to the left, improve= , (0 missing) x79 < to the right, improve= , (0 missing) x18 < to the right, improve= , (0 missing)
Predictions from a Tree model §Prediction from a tree is obtained by “dropping” x down the tree until it gets to a terminal node. §The predicted value is the average of the response values of training data points in that terminal node. §Example: if x 1 >= and x 82 >= then Prediction=0.61
Pruning of CART trees §Prediction Error(PE) = Variance + Bias 2 §PE vs Tree Size has U-shape: very large and very small trees are bad §Trees are grown until terminal nodes become small... §... and then pruned back §Use holdout data to estimate PE of trees. §Select tree that has smallest PE.
Randomised Regression Trees I §Each tree is grown on a bootstrap sample: N data points are sampled with replacement §Each such sample contains ~63% of original data points - some records occur multiple times §Each tree is built on its own bootstrap sample - trees are likely to be different
Randomised Regression Trees II §At each split, only M randomly selected predictors are allowed to compete as potential splitters, i.e. 10 out of 100. §New group of eligible splitters is selected at random at each step. §At each step the splitter selected is likely to be somewhat suboptimal §Every predictor gets a chance to compete as a splitter: important predictors are very likely to be eventually used as splitters
Competitor List for Randomised RT Primary splits: x6 < to the left, improve= , (0 missing) x78 < to the left, improve= , (0 missing) x62 < to the left, improve= , (0 missing) x79 < to the right, improve= , (0 missing) x80 < to the left, improve= , (0 missing) x24 < to the left, improve= , (0 missing) x90 < to the right, improve= , (0 missing) x75 < to the right, improve= , (0 missing) x68 < to the left, improve= , (0 missing) Y < to the right, improve= , (0 missing) x34 < to the left, improve= , (0 missing)
Randomised Regression Trees III §M=1: splitter selected at random, but not the split point. §M=P: original deterministic CART algorithm §Trees deliberately are not pruned.
Combining the Trees §Each tree represents a regression model which fits the training data very closely: low bias, high variance model. §The idea behind RF: take predictions from large number of highly variable trees and average them. §The result is: low bias, low variance model
Correlation Vs Strength §Another decomposition for PE of Random Forests: PE(RF) < (BT)·PE(Tree) § (BT): correlation between any 2 trees in a forest §PE(Tree): prediction error (strength) of a single tree. §M=1: low correlation, low strength §M=P: high correlation, high strength
RF as K-NN regression model I §RF induces proximity measure in the predictor space P(x 1,x 2 )= Proportion of trees where x 1 and x 2 landed in the same terminal node. §Prediction at point x: §Only fraction of data points actually contributes to prediction. §Strongly resembles formula used for K-NN predictions
RF as K-NN regression model II §Lin, Y., Jeon, Y. Random Forests and Adaptive Nearest Neighbours, Technical Report 1055, Department of Statistics, University of Wisconsin, §Breiman, L. Consistency for a Simple Model of Random Forests. Technical Report 670, Statistics Department, University of California at Berkeley, 2004
It was shown that: §Randomisation does reduce variance component §Optimal M is independent of sample size §RF does behave as an Adaptive K-Nearest Neighbour model: shape and size of neighbourhood is adapted to the local behaviour of target regression function.
Case Study: Postcode Ranking in Motor Insurance §575 postcodes in NSW §For each postcode: number of claims as well as “exposure” - number of policies in a postcode §Problem: Ranking of postcodes for pricing purposes
Approach §Each postcode is represented by (x,y) coordinates of its centroid. §Model expected claim frequency as function of (x,y). §The target surface is likely to be highly irregular. §Add coordinates of postcodes along 100 randomly generated directions to allow greater flexibility.
Tuning RF: M
Tuning RF: Size of Node
Things not covered §Clustering, missing value imputation and outlier detection §Identification of important variables §OOB testing of RF models §Postprocessing of RF models