Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Three kinds of learning
Data mining and statistical learning - lecture 13 Separating hyperplane.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Data mining and machine learning A brief introduction.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Linear Regression. Fitting Models to Data Linear Analysis Decision Trees.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CLASSIFICATION: Ensemble Methods
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
ID3 Algorithm Michael Crawford.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Classification and Regression Trees
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar Richard Derrig, PhD, Opal Consulting Louise Francis, FCAS,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Lecture 16. Bagging Random Forest and Boosting¶
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Machine Learning with Spark MLlib
Chapter 7. Classification and Prediction
Data Mining Practical Machine Learning Tools and Techniques
Bagging and Random Forests
Week 2 Presentation: Project 3
Lecture 15. Decision Trees¶
Lecture 19. Review CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Rule Induction for Classification Using
Eco 6380 Predictive Analytics For Economists Spring 2016
Boosting and Additive Trees (2)
Data mining and statistical learning, lecture 1b
Boosting and Additive Trees
ECE 5424: Introduction to Machine Learning
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Introduction to Data Mining, 2nd Edition by
ECE 471/571 – Lecture 12 Decision Tree.
How to handle missing data values
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
Lecture 1: Introduction to Machine Learning Methods
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Statistical Learning Dong Liu Dept. EEIS, USTC.
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Presentation transcript:

Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan

Announcements Inconsistencies with 1 or 2 TFs grading. Some of you got higher grades than you should have, some lower

Outline Boosting Miscellaneous other issues Examples

Outline Boosting Miscellaneous other issues Examples

Boosting Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Bagging: Generate multiple trees from bootstrapped data and average the trees. Recall bagging results in i.d. trees and not i.i.d. RF produces i.i.d (or more independent) trees by randomly selecting a subset of predictors at each step

Boosting “Boosting is one of the most powerful learning ideas introduced in the last twenty years.” Hastie-Tibshirani-Friedman,The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer (2009)

Boosting Boosting works very differently. Boosting does not involve bootstrap sampling Trees are grown sequentially: each tree is grown using information from previously grown trees 3. Like bagging, boosting involves combining a large number of decision trees, f1, . . . , fB

Sequential fitting (Regression) Given the current model, We fit a decision tree to the residuals from the model. Response variable now is the residuals and not Y We then add this new decision tree into the fitted function in order to update the residuals The learning rate has to be controlled

Boosting for regression 1. Set f(x)=0 and ri =yi for all i in the training set. 2. For b=1,2,...,B, repeat: a. Fit a tree with d splits(+1 terminal nodes) to the training data (X, r). b. Update the tree by adding in a shrunken version of the new tree: c. Update the residuals, 3. Output the boosted model,

Boosting tuning parameters The number of trees B. RF and Bagging do not overfit as B increases. Boosting can overfit! Cross Validation The shrinkage parameter λ, a small positive number. Typical values are 0.01 or 0.001 but it depends on the problem. λ only controls the learning rate The number d of splits in each tree, which controls the complexity of the boosted ensemble. Stumpy trees, d = 1 works well.

Boosting for classification Challenge question for HW7

Different flavors ID3, or alternative Dichotomizer, was the first of three Decision Tree implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.) Only categorical predictors and no pruning. C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise the training data. Used in orange http://orange.biolab.si/

Different flavors C5.0, The most significant feature unique to C5.0 is a scheme for deriving rule sets. After a tree is grown, the splitting rules that define the terminal nodes can sometimes be simplified: that is, one or more condition can be dropped without changing the subset of observations that fall in the node. CART or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree, though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5. Used in sklearn

Missing data What if we miss predictor values? Remove those examples => depletion of the training set Impute the values either with mean, knn, from the marginal or joint distributions Trees have a nicer way of doing this Categorical, create a “missing” category Surrogate predictors

Missing data A surrogate is a substitute for the primary splitter. The ideal surrogate splits the data in exactly the same way as the primary split, in other words, we are looking for something else in the data that can do the same work that the primary splitter accomplished. First, the tree is split in order to find the primary splitter. Then, surrogates are searched for, even when there is no missing data CART attempts to find at least five surrogates for every node,

Missing data The primary splitter may never have been missing in the training data. However, it may be missing on future data. When it is missing, then the surrogates will be able to take over and take on the work that the primary splitter accomplished during the initial building of the tree CART attempts to find at least five surrogates for every node, . In addition, surrogates reveal common patterns among predictors and the data set.

Further reading Pattern Recognition and Machine Learning, Christopher M. Bishop The Elements of Statistical Learning Trevor Hastie, Robert Tibshirani, Jerome Friedman http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf Gradient Boosting:https://en.wikipedia.org/wiki/Gradient_boosting) Boosting:https://en.wikipedia.org/wiki/Boosting_(machine_learning)) Ensemble Learning(https://en.wikipedia.org/wiki/Ensemble_learning)

Random best practices

Cross Validation