Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Random Forest Predrag Radenković 3237/10
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Introduction to Directed Data Mining: Decision Trees
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Issues with Data Mining
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Model Building III – Remedial Measures KNNL – Chapter 11.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Chapter 9 – Classification and Regression Trees
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Scaling up Decision Trees. Decision tree learning.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Classification and Regression Trees
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Lecture 16. Bagging Random Forest and Boosting¶
Decision tree and random forest
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Machine Learning: Ensemble Methods
JMP Discovery Summit 2016 Janet Alvarado
Data Mining Practical Machine Learning Tools and Techniques
Bagging and Random Forests
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Eco 6380 Predictive Analytics For Economists Spring 2016
Chapter 13 – Ensembles and Uplift
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
ECE 471/571 – Lecture 12 Decision Tree.
A “Holy Grail” of Machine Learing
Ensemble learning Reminder - Bagging of Trees Random Forest
Classification with CART
Chapter 7: Transformations
CS639: Data Management for Data Science
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Regression Tree Ensembles Sergey Bakin

Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional vector): can be fixed design points or independently sampled from the same distribution. §y is numeric response variable. §Problem: estimate regression function E(y|x)=F(x) - can be very complex function.

Ensembles of Models §<1990’s: Multitude of techniques developed to tackle regression problems §1990’s: New idea - use collection of “basic” models (ensemble) §Substantial improvements in accuracy compared with any single “basic” model §Examples: Bagging, Boosting, Random Forests

Key Ingredients of Ensembles §Type of “basic” model used in the ensemble (RT, K-NN, NN) §The way basic models are built (data sub- sampling schemes, injection of randomness) §The way basic models are combined §Possible postprocessing (tuning) of resulting ensemble (optional)

Random Forests (RF) §Developed by Leo Brieman, Department of Statistics, University of California, Berkeley in late 1990’s. §RF resistant to overfitting §RF capable of handling large number of predictors

Key Features of RF §Randomised Regression Tree is a basic model §Each tree is grown on a bootsrap sample §Ensemble (Forest) is formed by averaging of predictions from individual trees

Regression Trees §Performs recursive binary division of data: start with Root node (all points) and split it into 2 parts (Left Node and Right Node) §Split attempts to separate data points with high y i ’s from data points with low y i ’s as much as possible §Split is based on a single predictor and a split point §To find the best splitter all possible splits and split points are tried §Splitting repeated for Children.

RT Competitor List Primary splits: x2 < to the right, improve= , (0 missing) x6 < to the left, improve= , (0 missing) x51 < to the left, improve= , (0 missing) x30 < to the right, improve= , (0 missing) x67 < to the right, improve= , (0 missing) x78 < to the left, improve= , (0 missing) x62 < to the left, improve= , (0 missing) x44 < to the left, improve= , (0 missing) x25 < to the right, improve= , (0 missing) x21 < to the right, improve= , (0 missing) x82 < to the left, improve= , (0 missing) x79 < to the right, improve= , (0 missing) x18 < to the right, improve= , (0 missing)

Predictions from a Tree model §Prediction from a tree is obtained by “dropping” x down the tree until it gets to a terminal node. §The predicted value is the average of the response values of training data points in that terminal node. §Example: if x 1 >= and x 82 >= then Prediction=0.61

Pruning of CART trees §Prediction Error(PE) = Variance + Bias 2 §PE vs Tree Size has U-shape: very large and very small trees are bad §Trees are grown until terminal nodes become small... §... and then pruned back §Use holdout data to estimate PE of trees. §Select tree that has smallest PE.

Randomised Regression Trees I §Each tree is grown on a bootstrap sample: N data points are sampled with replacement §Each such sample contains ~63% of original data points - some records occur multiple times §Each tree is built on its own bootstrap sample - trees are likely to be different

Randomised Regression Trees II §At each split, only M randomly selected predictors are allowed to compete as potential splitters, i.e. 10 out of 100. §New group of eligible splitters is selected at random at each step. §At each step the splitter selected is likely to be somewhat suboptimal §Every predictor gets a chance to compete as a splitter: important predictors are very likely to be eventually used as splitters

Competitor List for Randomised RT Primary splits: x6 < to the left, improve= , (0 missing) x78 < to the left, improve= , (0 missing) x62 < to the left, improve= , (0 missing) x79 < to the right, improve= , (0 missing) x80 < to the left, improve= , (0 missing) x24 < to the left, improve= , (0 missing) x90 < to the right, improve= , (0 missing) x75 < to the right, improve= , (0 missing) x68 < to the left, improve= , (0 missing) Y < to the right, improve= , (0 missing) x34 < to the left, improve= , (0 missing)

Randomised Regression Trees III §M=1: splitter selected at random, but not the split point. §M=P: original deterministic CART algorithm §Trees deliberately are not pruned.

Combining the Trees §Each tree represents a regression model which fits the training data very closely: low bias, high variance model. §The idea behind RF: take predictions from large number of highly variable trees and average them. §The result is: low bias, low variance model

Correlation Vs Strength §Another decomposition for PE of Random Forests: PE(RF) <  (BT)·PE(Tree) §  (BT): correlation between any 2 trees in a forest §PE(Tree): prediction error (strength) of a single tree. §M=1: low correlation, low strength §M=P: high correlation, high strength

RF as K-NN regression model I §RF induces proximity measure in the predictor space P(x 1,x 2 )= Proportion of trees where x 1 and x 2 landed in the same terminal node. §Prediction at point x: §Only fraction of data points actually contributes to prediction. §Strongly resembles formula used for K-NN predictions

RF as K-NN regression model II §Lin, Y., Jeon, Y. Random Forests and Adaptive Nearest Neighbours, Technical Report 1055, Department of Statistics, University of Wisconsin, §Breiman, L. Consistency for a Simple Model of Random Forests. Technical Report 670, Statistics Department, University of California at Berkeley, 2004

It was shown that: §Randomisation does reduce variance component §Optimal M is independent of sample size §RF does behave as an Adaptive K-Nearest Neighbour model: shape and size of neighbourhood is adapted to the local behaviour of target regression function.

Case Study: Postcode Ranking in Motor Insurance §575 postcodes in NSW §For each postcode: number of claims as well as “exposure” - number of policies in a postcode §Problem: Ranking of postcodes for pricing purposes

Approach §Each postcode is represented by (x,y) coordinates of its centroid. §Model expected claim frequency as function of (x,y). §The target surface is likely to be highly irregular. §Add coordinates of postcodes along 100 randomly generated directions to allow greater flexibility.

Tuning RF: M

Tuning RF: Size of Node

Things not covered §Clustering, missing value imputation and outlier detection §Identification of important variables §OOB testing of RF models §Postprocessing of RF models