Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.

Slides:



Advertisements
Similar presentations
Data Mining and Machine Learning
Advertisements

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Chapter 7 – Classification and Regression Trees
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Decision Trees and Boosting
Longin Jan Latecki Temple University
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Induction of Decision Trees
ICS 273A Intro Machine Learning
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
BOOSTING David Kauchak CS451 – Fall Admin Final project.
B-tagging Performance based on Boosted Decision Trees Hai-Jun Yang University of Michigan (with X. Li and B. Zhou) ATLAS B-tagging Meeting February 9,
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Kalanand Mishra BaBar Coll. Meeting February, /8 Development of New Kaon Selectors Kalanand Mishra University of Cincinnati.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
University of Waikato, New Zealand
Bagging and Random Forests
C4.5 - pruning decision trees
Trees, bagging, boosting, and stacking
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Decision Tree Saed Sayad 9/21/2018.
ECE 5424: Introduction to Machine Learning
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Ensemble learning Reminder - Bagging of Trees Random Forest
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Presentation transcript:

Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009

2 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Growing a Decision Tree  start with training sample at the root node  split training sample at node into two, using a cut in the variable that gives best separation gain  continue splitting until:  minimal #events per node  maximum number of nodes  maximum depth specified  a split doesn’t give a minimum separation gain  leaf-nodes classify S,B according to the majority of events or give a S/B probability  Why no multiple branches (splits) per node ?  Fragments data too quickly; also: multiple splits per node = series of binary node splits  What about multivariate splits?  Time consuming  other methods more adapted for such correlatios

3 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Decision Tree Pruning  One can continue node splitting until all leaf nodes are basically pure (using the training sample)  obviously: that’s overtraining  Two possibilities:  stop growing earlier generally not a good idea, useless splits might open up subsequent usefull splits  grow tree to the end and “cut back”, nodes that seem statistically dominated:  pruning  e.g. Cost Complexity pruning:  assign to every sub-tree, T C(T,  ) :  find subtree T with minmal C(T,  ) for given   prune up to value of a that does not show overtraining in the test sample Loss function regularisaion/ cost parameter

4 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Decision Tree Pruning Decision tree after pruning Decision tree before pruning  “Real life” example of an optimally pruned Decision Tree:  Pruning algorithms are developed and applied on individual trees  optimally pruned single trees are not necessarily optimal in a forest !

5 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Boosting Training Sample classifier C (0) (x) Weighted Sample re-weight classifier C (1) (x) Weighted Sample re-weight classifier C (2) (x) Weighted Sample re-weight Weighted Sample re-weight classifier C (3) (x) classifier C (m) (x)

6 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Adaptive Boosting (AdaBoost) Training Sample classifier C (0) (x) Weighted Sample re-weight classifier C (1) (x) Weighted Sample re-weight classifier C (2) (x) Weighted Sample re-weight Weighted Sample re-weight classifier C (3) (x) classifier C (m) (x)  AdaBoost re-weights events misclassified by previous classifier by:  AdaBoost weights the classifiers also using the error rate of the individual classifier according to:

7 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees AdaBoost: A simple demonstration The example: (somewhat artificial…but nice for demonstration) : Data file with three “bumps” Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps ) B S var(i) > xvar(i) <= x Two reasonable cuts: a) Var0 > 0.5  ε signal =66% ε bkg ≈ 0% misclassified events in total 16.5% or b) Var0 < -0.5  ε signal =33% ε bkg ≈ 0% misclassified events in total 33% the training of a single decision tree stump will find “cut a)” a) b)

8 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees AdaBoost: A simple demonstration The first “tree”, choosing cut a) will give an error fraction: err = and hence will chose: “cut b)”: Var0 < -0.5 b) The combined classifier: Tree1 + Tree2 the (weighted) average of the response to a test event from both trees is able to separate signal from background as good as one would expect from the most powerful classifier  before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5  the next “tree” sees essentially the following data sample: re-weight

9 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees AdaBoost: A simple demonstration Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost

10 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Bagging and Randomised Trees other classifier combinations than Ada- or Gradient boost:  Bagging:  combine trees grown from “bootstrap” samples (i.e re-sample training data with replacement)  Randomised Trees: (Random Forest: trademark L.Breiman, A.Cutler)  combine trees grown with:  random subsets of the training data only  consider at each node only a random subsets of variables for the split  NO Pruning!  These combined classifiers work surprisingly well, are very stable and almost perfect “out of the box” classifiers

11 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees AdaBoost vs other Combined Classifiers _AdaBoost Sometimes people present “boosting” as nothing else then just “smearing” in order to make the Decision Trees more stable w.r.t statistical fluctuations in the training.  clever “boosting” however can do more, than for example: - Random Forests - Bagging as in this case, pure statistical fluctuations are not enough to enhance the 2 nd peak sufficiently however: a “fully grown decision tree” is much more than a “weak classifier”  “stabilization” aspect is more important Surprisingly: Often using smaller trees (weaker classifiers) in AdaBoost and other clever boosting algorithms (i.e. gradient boost) seems to give overall significantly better performance !

12 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Boosting at Work Gradient-Boost Max #tree depth=2 AdaBoost Max #tree depth=2, Min # even in leaf=40 AdaBoost Max #tree depth=2, Min # even in leaf=400 AdaBoost Max #tree depth=2, Min # even in leaf=40 AdaBoost Max #tree depth=3, Min # even in leaf=40 depth=2 ==  For some reason “pruning” doesn’t work as expected.  restricting the growth of the tree seems better…  This “feature” is particulary there heavily weighted events!!

13 Helge VossMVA Workshop, CERN, July ― Training of Boosted DecisionTrees Summary It is vital (even thoug decision trees are typically seen as good “out of the box” methods) to “play” with all tuning parameters to get the best restult Pruning has been developed for “unboosted” Decision trees  obviously with pruned trees things look slightly different! The “optimal” pruning technique seems to be different! still not finally understood…