Boosted Decision Trees, a Powerful Event Classifier

Slides:

Advertisements

Similar presentations

Ensemble Learning – Bagging, Boosting, and Stacking, and other topics

Advertisements

Brief introduction on Logistic Regression

Model Assessment, Selection and Averaging

Chapter 7 – Classification and Regression Trees

CMPUT 466/551 Principal Source: CMU

Searching for Single Top Using Decision Trees G. Watts (UW) For the DØ Collaboration 5/13/2005 – APSNW Particles I.

Sparse vs. Ensemble Approaches to Supervised Learning

Data Mining Techniques Outline

Ensemble Learning: An Introduction

Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.

Adaboost and its application

Three kinds of learning

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

G. Cowan Lectures on Statistical Data Analysis Lecture 7 page 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem 2Random variables and.

CS 391L: Machine Learning: Ensembles

Chapter 9 – Classification and Regression Trees

G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 3: Multivariate Methods (II) 清华大学高能物理研究中心 2010 年 4 月 12—16.

Benk Erika Kelemen Zsolt

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

B-tagging Performance based on Boosted Decision Trees Hai-Jun Yang University of Michigan (with X. Li and B. Zhou) ATLAS B-tagging Meeting February 9,

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

MiniBooNE Event Reconstruction and Particle Identification Hai-Jun Yang University of Michigan, Ann Arbor (for the MiniBooNE Collaboration) DNP06, Nashville,

Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Non-Bayes classifiers. Linear discriminants, neural networks.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Ensemble Methods in Machine Learning

Classification Ensemble Methods 1

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Analysis of H  WW  l l Based on Boosted Decision Trees Hai-Jun Yang University of Michigan (with T.S. Dai, X.F. Li, B. Zhou) ATLAS Higgs Meeting September.

B-tagging based on Boosted Decision Trees

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Search for H  WW*  l l Based on Boosted Decision Trees Hai-Jun Yang University of Michigan LHC Physics Signature Workshop January 5-11, 2008.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Study of the electron identification algorithms in TRD Andrey Lebedev 1,3, Semen Lebedev 2,3, Gennady Ososkov 3 1 Frankfurt University, 2 Giessen University.

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Bagging and Random Forests

Deep Feedforward Networks

Trees, bagging, boosting, and stacking

Boosting and Additive Trees

MiniBooNE Event Reconstruction and Particle Identification

Introduction to Boosting

Ensemble Methods for Machine Learning: The Ensemble Strikes Back

Ensemble learning.

Parametric Methods Berlin Chen, 2005 References:

Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.

Presentation transcript:

Boosted Decision Trees, a Powerful Event Classifier Byron Roe, Haijun Yang, Ji Zhu University of Michigan Replace ANN, 10-15 years, names collab. Byron Roe

Outline What is Boosting? Comparisons of ANN and Boosting for the MiniBooNE experiment Comparisons of Boosting and Other Classifiers Some tested modifications to Boosting and miscellaneous Byron Roe

Training and Testing Events Both ANN and boosting algorithms use a set of known events to train the algorithm. It would be biased to use the same set to estimate the accuracy of the selection; the algorithm has been trained for this specific sample. A new set, the testing set of events, is used to test the algorithm. All results quoted here are for the testing set. Byron Roe

Boosted Decision Trees What is a decision tree? What is “boosting the decision trees”? Two algorithms for boosting. Byron Roe

Decision Tree Background/Signal Go through all PID variables and find best variable and value to split events. For each of the two subsets repeat the process Proceeding in this way a tree is built. Ending nodes are called leaves. Byron Roe

Select Signal and Background Leaves Assume an equal weight of signal and background training events. If more than ½ of the weight of a leaf corresponds to signal, it is a signal leaf; otherwise it is a background leaf. Signal events on a background leaf or background events on a signal leaf are misclassified. Byron Roe

Criterion for “Best” Split Purity, P, is the fraction of the weight of a leaf due to signal events. Gini: Note that gini is 0 for all signal or all background. The criterion is to minimize gini_left + gini_right of the two children from a parent node Byron Roe

Criterion for Next Branch to Split Pick the branch to maximize the change in gini. Criterion = giniparent – giniright-child –ginileft-child Byron Roe

Decision Trees This is a decision tree They have been known for some time, but often are unstable; a small change in the training sample can produce a large difference. Byron Roe

Boosting the Decision Tree Give the training events misclassified under this procedure a higher weight. Continuing build perhaps 1000 trees and do a weighted average of the results (1 if signal leaf, -1 if background leaf). Byron Roe

Two Commonly used Algorithms for changing weights 1. AdaBoost 2. Epsilon boost (shrinkage) Byron Roe

Definitions Xi= set of particle ID variables for event i Yi= 1 if event i is signal, -1 if background Tm(xi) = 1 if event i lands on a signal leaf of tree m and -1 if the event lands on a background leaf. Byron Roe

AdaBoost Define err_m = weight wrong/total weight Increase weight for misidentified events Byron Roe

Scoring events with AdaBoost Renormalize weights Score by summing over trees Byron Roe

Epsilon Boost (shrinkage) After tree m, change weight of misclassified events, typical ~0.01 (0.03). For misclassfied events: Renormalize weights Score by summing over trees Byron Roe

Unwgted, Wgted Misclassified Event Rate vs No. Trees Byron Roe

Comparison of methods Epsilon boost changes weights a little at a time Let y=1 for signal, -1 for bkrd, F=score summed over trees AdaBoost can be shown to try to optimize each change of weights. exp(-yF) is minimized; The optimum value is F=½ log odds probability that Y is 1 given x Byron Roe

The MiniBooNE Collaboration Byron Roe

40’ D tank, mineral oil, surrounded by about 1280 photomultipliers 40’ D tank, mineral oil, surrounded by about 1280 photomultipliers. Both Cher. and scintillation light. Geometrical shape and timing distinguishes events Byron Roe

Tests of Boosting Parameters 45 Leaves seemed to work well for our application 1000 Trees was sufficient (or over-sufficient). AdaBoost with beta about 0.5 and epsilonBoost with epsilon about 0.03 worked well, although small changes made little difference. For other applications these numbers may need adjustment For MiniBooNE need around 100 variables for best results. Too many variables degrades performance. Relative ratio = const.*(fraction bkrd kept)/ (fraction signal kept). Smaller is better! Byron Roe

Effects of Number of Leaves and Number of Trees Smaller is better! R = c X frac. sig/frac. bkrd. Byron Roe

Number of feature variables in boosting In recent trials we have used 182 variables. Boosting worked well. However, by looking at the frequency with which each variable was used as a splitting variable, it was possible to reduce the number to 86 without loss of sensitivity. Several methods for choosing variables were tried, but this worked as well as any After using the frequency of use as a splitting variable, some further improvement may be obtained by looking at the correlations between variables. Byron Roe

Effect of Number of PID Variables Byron Roe

Comparison of Boosting and ANN Relative ratio here is ANN bkrd kept/Boosting bkrd kept. Greater than one implies boosting wins! A. All types of background events. Red is 21 and black is 52 training var. B. Bkrd is pi0 events. Red is 22 and black is 52 training variables Byron Roe Percent nue CCQE kept

Numerical Results from sfitter (a second reconstruction program) Extensive attempt to find best variables for ANN and for boosting starting from about 3000 candidates Train against pi0 and related backgrounds—22 ANN variables and 50 boosting variables For the region near 50% of signal kept, the ratio of ANN to boosting background was about 1.2 Byron Roe

Robustness For either boosting or ANN, it is important to know how robust the method is, i.e. will small changes in the model produce large changes in output. In MiniBooNE this is handled by generating many sets of events with parameters varied by about 1 sigma and checking on the differences. This is not complete, but, so far, the selections look quite robust for boosting. Byron Roe

How did the sensitivities change with a new optical model? In Nov. 04, a new, much changed optical model of the detector was introduced for making MC events Both rfitter and sfitter needed to be changed to optimize fits for this model Using the SAME feature variables as for the old model: For both rfitter and sfitter, the boosting results were about the same. For sfitter, the ANN results became about a factor of 2 worse Byron Roe

For ANN For ANN one needs to set temperature, hidden layer size, learning rate… There are lots of parameters to tune. For ANN if one a. Multiplies a variable by a constant, var(17) 2.var(17) b. Switches two variables var(17)var(18) c. Puts a variable in twice The result is very likely to change. Byron Roe

For Boosting Only a few parameters and once set have been stable for all calculations within our experiment. Let y=f(x) such that if x1>x2 then y1>y2, then the results are identical as it only depends on the ordering of values. Putting variables in twice or changing the order of variables has no effect. Byron Roe

Tests of Boosting Variants None clearly better than AdaBoost or EpsilonBoost Byron Roe

Byron Roe

Can Convergence Speed be Improved? Removing correlations between variables helps. Random Forest (using random fraction[1/2] of training events per tree with replacement and random fraction of PID variables per node (all PID var. used for test here) WHEN combined with boosting. Softening the step function scoring: y=(2*purity-1); score = sign(y)*sqrt(|y|). Byron Roe

Smooth Scoring and Step Function Byron Roe

Performance of AdaBoost with Step Function and Smooth Function Byron Roe

Post-Fitting Post-Fitting is an attempt to reweight the trees when summing tree scores after all the trees are made Two attempts produced only a very modest (few %), if any, gain. Byron Roe

Conclusions Boosting is very robust. Given a sufficient number of leaves and trees AdaBoost or EpsilonBoost reaches an optimum level, which is not bettered by any variant tried. Boosting was better than ANN in our tests by 1.2-1.8. There are ways (such as the smooth scoring function) to increase convergence speed in some cases. Post-fitting makes only a small improvement. Several techniques can be used for weeding variables. Examining the frequency with which a given variable is used works reasonably well. Downloads in FORTRAN or C++ available at: http://www.gallatin.physics.lsa.umich.edu/~roe/ Byron Roe

References R.E. Schapire ``The strength of weak learnability.’’ Machine Learning 5 (2), 197-227 (1990). First suggested the boosting approach for 3 trees taking a majority vote Y. Freund, ``Boosting a weak learning algorithm by majority’’, Information and Computation 121 (2), 256-285 (1995) Introduced using many trees Y. Freund and R.E. Schapire, ``Experiments with an new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, Morgan Kauffman, SanFrancisco, pp.148-156 (1996). Introduced AdaBoost J. Friedman, T. Hastie, and R. Tibshirani, ``Additive logistic regression: a statistical view of boosting’’, Annals of Statistics 28 (2), 337-407 (2000). Showed that AdaBoost could be looked at as successive approximations to a maximum likelihood solution. T. Hastie, R. Tibshirani, and J. Friedman, ``The Elements of Statistical Learning’’ Springer (2001). Good reference for decision trees and boosting. B.P. Roe et. al., “Boosted decision trees as an alternative to artificial neural networks for particle identification”, NIM A543, pp. 577-584 (2005). Hai-Jun Yang, Byron P. Roe, and Ji Zhu, “Studies of Boosted Decision Trees for MiniBooNE Particle Identification”, Physics/0508045, submitted to NIM, July 2005. Byron Roe

Byron Roe

Example AdaBoost: Suppose the weighted error rate is 40%, i.e., err=0.4 and beta = 1/2 Then alpha = (1/2)ln((1-.4)/.4)= .203 Weight of a misclassified event is multiplied by exp(0.203)=1.225 Epsilon boost: The weight of wrong events is increased by exp(2X.01) = 1.02 Byron Roe

AdaBoost Optimization Byron Roe

AdaBoost Fitting is Monotone Byron Roe

The MiniBooNE Experiment Byron Roe

Byron Roe

Byron Roe

Comparison of 21 (or 22) vs 52 variables for Boosting Vertical axis is the ratio of bkrd kept for 21(22) var./that kept for 52 var., both for boosting Red is if training sample is cocktail and black is if training sample is pi0 Error bars are MC statistical errors only Ratio Byron Roe

Artificial Neural Networks Use to classify events, for example into “signal” and “noise/background”. Suppose you have a set of “feature variables”, obtained from the kinematic variables of the event Byron Roe

Neural Network Structure Combine the features in a non-linear way to a “hidden layer” and then to a “final layer” Use a training set to find the best wik to distinguish signal and background Byron Roe

Feedforward Neural Network--I Byron Roe

Feedforward Neural Network--II Byron Roe

Determining the weights Suppose want signal events to give output =1 and background events to give output=0 Mean square error given Np training events with desired outputs oi either 0 or 1, and ANN results ti. Byron Roe

Back Propagation to Determine Weights Byron Roe

AdaBoost vs Epsilon Boost and differing tree sizes A. Bkrd for 8 leaves/ bkrd for 45 leaves. Red is AdaBoost, Black is Epsilon Boost B. Bkrd for AdaBoost/ bkrd for Epsilon Boost Nleaves = 45. Byron Roe

Adaboost Output for Training and Test Samples Byron Roe