Additive Logistic Regression: a Statistical View of Boosting

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Linear Regression.
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Boosting Rong Jin.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Evidence Contrary to the Statistical View of Boosting David Mease & Abraham Wyner.
Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.
Supervised Learning Recap
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Visual Recognition Tutorial
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Boosting CMPUT 615 Boosting Idea We have a weak classifier, i.e., it’s error rate is a little bit better than 0.5. Boosting combines a lot of such weak.
Sparse vs. Ensemble Approaches to Supervised Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
A Brief Introduction to Adaboost
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Visual Recognition Tutorial
Sparse vs. Ensemble Approaches to Supervised Learning
Linear Discriminant Functions Chapter 5 (Duda et al.)
Experimental Evaluation
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Boosting for tumor classification
Maximum likelihood (ML)
Ensemble Learning (2), Tree and Forest
Review of Lecture Two Linear Regression Normal Equation
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
Benk Erika Kelemen Zsolt
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Linear Models for Classification
Lecture 09 03/01/2012 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Ensemble Methods in Machine Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Validation methods.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Bagging and Random Forests
Deep Feedforward Networks
Boosting and Additive Trees (2)
Boosting and Additive Trees
10701 / Machine Learning.
Asymmetric Gradient Boosting with Application to Spam Filtering
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Boosting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ensemble learning.
Presentation transcript:

Additive Logistic Regression: a Statistical View of Boosting J. Friedman, T. Hastie, & R. Tibshirani Journal of Statistics

Outline Introduction A brief history of boosting Additive Models AdaBoost – an Additive logistic regression model Simulation studies

Discrete AdaBoost

Performance of Discrete AdaBoost

Re-sampling in AdaBoost Connection with bagging Bagging is a variance reduction technique. Is Boosting also a variance reduction technique? Boosting performs comparably well when: Weighted tree-growing algorithm rather than weighted resampling. Removing the randomizatoin component. Stumps Have low variance but high bias Boosting is capable of both bias and variance reduction.

Real AdaBoost

Statistical Interpretation of the AdaBoost Fitting an additive model by minimizing squared-error loss in a forward stagewise manner. At the mth stage, fix Minimize squared error to obtain AdaBoost fits an additive model using a criterion similar to, but not the same as, the binomial log-likelihood Better loss function for classification

A brief history of boosting The first simple boosting procedure is developed in the PAC-learning framework. Strength of Weak Learnability After learning an initial classifier h1 on the first N training points. h2 is learned on a new sample of N points, half of which are misclassified by h1. h3 is learned on N points for which h1 and h2 disagree. The boosted classifier is hB = Majority Vote(h1, h2, h3)

Additive Models Addtive regression models Extended additive models Classification problems

Additive Regression Models Modeling the mean The additive model: There is a separate function for each of the p input variables xj. Backfitting algorithm A modular “Gauss-Seidel” algorithm for fitting additive models Backfitting update Backfitting cycles are repeated until convergence Backfitting converges to the minimizer of under fairly general conditions.

Extended Additive Models (1) Additive models whose elements are functions of potentially all of the input features x. If we set Generalized backfitting algorithm Updates Greedy forward stepwise approach

Extended Additive Models (2) Algorithm for fitting a single weak leaner to data In the forward stepwise procedure This can be viewed as a procudure for boosting a weak learner to form a powerful commttee

Classification problems Additive logistic regression Inverting These models are usually fit by maximizing the binomial log-likelihood.

AdaBoost – an Additive Logistic Regression Model AdaBoost can be interpreted as stage-wise estimation procedures for fitting an additive logistic regression model AdaBoost optimize an exponential criterion which to second order is equivalent to the binomial log-likelihood criterion Proposing a more standard likelihood-based boosting procedure

An Exponential Criterion (1) Minimizing the criterion The function F(x) that minimizes J(F) is the symmetric logistic transform of P(y=1|x). Can be proved by setting the derivative to zero

An Exponential Criterion (2) The usual logistic model The Discrete AdaBoost algorithm (population version) builds an additive logistic regression model via Newton-like update for minimizing

Derivation (a) , where For c > 0, minimizing (a) is equivalent to maximizing

Continued… The solution is Note that Minimizing a quadratic approximation to the critetrion leads to a weighted least-sqares choice of f(x). Minimizing J(F+cf) to determine c: where,

Update for F(x) Since The function and weight updates are of an identical form to those used in Discrete AdaBoost

Corollary After each update to the weights, the weighted misclassification error of the most recent weak learner is 50% Weights are updated to make the new weighted problem maximally difficult for the next weak learner

Derivation The Real AdaBoost algorithm fits an additive logistic regression model by stage-wise and approximate optimization of Dividing through by And setting the derivative w.r.t f(x) to zero

Corollary At the optimal F(x), the weighted conditional mean of y is 0.

Why Ee-yF(x)? The populbation minimizer of and coincide.

Losses as Approximations to Misclassification Error

Direct optimization of the binomial log-likelihood Fitting additive logistic regression models by stage-wise optimization of the Bernoulli log-likelihood.

Derivation of LogitBoost (1) Newton update ,where

Derivation of LogitBoost (2) Equivalently, the Newton update f(x) solves the weighted least-squares approximation to the log-likelihood

Optimizing Ee-yF(x) by Newton stepping Proposing the “Gentle AdaBoost” procedure that instead takes adaptive Newton steps much like the LogitBoost algorithm just described

Derivation The Gentle AdaBoost algorithm uses Newton steps for minimizing Ee-yF(x) . Newton update

Comparison with Real AdaBoost Update in Gentle AdaBoost Update in Real AdaBoost Log-ratios can be numerically unstable, leading to very large updates in pure regions. Empirical evidence suggests that this more conservative algorithm has similar performance to both the Real AdaBoost and LogitBoost algorightms.

Simulation Studies Four boosting methods compared here DAB: Discrete AdaBoost RAB: Real AdaBoost LB: LogitBoost GAB: Gentle AdaBoost

Data Generation All of the simulated examples involve fairly complex decision boundaries Ten input features randomly drawn from a 10-dim. standard normal dist. Approximately 1000 training observations in each class 10000 observations for test set. Averaged over 10 such indepently drawn training/test set combinations. C1 C2 C3

Additive Decision Boundary (1)

Additive Decision Boundary (2)

Additive Decision Boundary (3)

Boosting Trees with 8-terminal nodes Rapid decrease than for stumps The overall performance of DAB is much improved.

Analysis Optimal decision boundary for the above examples is also additive in the original features, with For RAG, GAB, and LB the error rate using the bigger trees is in fact 33% higher than that for stumps at 800 iterations, even though the former is four times more complex. Non-additive decision boundaries Boosting stumps would be less advantageous than using larger trees.

Non-additive Decision Boundaries Higher order basis functions provide the possibility to more accurately estimate those decision boundaries with high order interaction. Data generation 2 classes 5000 training observations drawn from a 10-dim normal dist. Class labels were randomly assigned to each observation with log-odds

Non-additive Decision Boundaries (2) None of these boosting methods perform well with stumps on this problem, The best error rate being 0.35

Non-additive Decision Boundaries (3) Small differentiation between RAB and GAB

Analysis Boosting stumps can sometimes be superior to using larger trees when decision boundaries can be closely approximated by functions that are additive in the original predictor features. Such complicated boundaries are not likely to often occur in pratice

Some experiments with Real World Data UC-Irvine machine learning archive + a popular simulated dataset. The real data examples fail to demonstrate performance differences between the various boosting methods.

Additive Logistic Trees ANOVA decomposition Allowing the base classifier to produce higher order interactions can reduce the accuracy of the final boosted model. Higher order interactions are produced by deeper trees. Maximum depth becomes a “meta-parameter” of the procedure to be estimated by some model selection technique, such as cross-validation.

Additive Logistic Trees (2) Growing trees until a maximum number M of terminal nodes are induced. “Additive logistic trees” (ALT) Combination of truncated best-first trees, with boosting. Another advantage of low order approximations is model visualization

Weight Trimming Trainig observations with weight wi less than a threshold Observations deleted at a particular iteration may therefore re-enter at later iterations. LogitBoost sometimes gives an advantage with weight trimming Weights measre nearness to the currently estimated decision boundary For the other three procedures the weight is monotone in Subsample passed to the base learner can be highly unbalanced

The test error for the letter recognition problem Black: uses all the training data Red dashed : subset based onweight thresholding

Further Generalizations of Boosting The Newton step can be replaced by a gradient step, slowing down the fitting procedure. Reducing susceptibility to overfitting Any smooth loss function can be used.

Concluding Remarks Bagging, randomized trees Boosting “Variance” reducing techniques Boosting Appears tobe mainly a “bias” reducing procedure. Boosting seems resistant to overfitting As the LogitBoost iterations proceed, the overall impact of changes introcued by fm(x) reduces. The stage-wise nature of the boosting algorithms does not allow the full collection of parameters to be jointly fit, and thus has far lower variance than the full parameterization might suggest. Classifiers are hurt less by overfitting than other function estimators