An Introduction to Boosting Yoav Freund Banter Inc.

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

On-line learning and Boosting
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Shai Raffaeli Seminar in mathematical biology
Ensemble Learning: An Introduction
Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Three kinds of learning
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
Online Learning Algorithms
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Boosting ---one of combining models Xin Li Machine Learning Course.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Reading: R. Schapire, A brief introduction to boosting
Boosting and Additive Trees (2)
Trees, bagging, boosting, and stacking
Boosting and Additive Trees
ECE 5424: Introduction to Machine Learning
Computational and Statistical Issues in Data-Mining
Asymmetric Gradient Boosting with Application to Spam Filtering
Adaboost Team G Youngmin Jun
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
A New Boosting Algorithm Using Input-Dependent Regularizer
Lecture 18: Bagging and Boosting
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Why does averaging work?
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Presentation transcript:

An Introduction to Boosting Yoav Freund Banter Inc.

Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications

Toy Example Computer receives telephone call Measures Pitch of voice Decides gender of caller Human Voice Male Female

Generative modeling Voice Pitch Probability mean1 var1 mean2 var2

Discriminative approach Voice Pitch No. of mistakes

Ill-behaved data Voice Pitch Probability mean1mean2 No. of mistakes

Traditional Statistics vs. Machine Learning Data Estimated world state Predictions Actions Statistics Decision Theory Machine Learning

ModelGenerativeDiscriminative GoalProbability estimates Classification rule Performance measure LikelihoodMisclassification rate Mismatch problems OutliersMisclassifications Comparison of methodologies

A weak learner weak learner A weak rule h Weighted training set (x1,y1,w1),(x2,y2,w2) … (xn,yn,wn) instances x1,x2,x3,…,xn h labels y1,y2,y3,…,yn The weak requirement: Feature vector Binary label Non-negative weights sum to 1

The boosting process weak learner h1 (x1,y1,1/n), … (xn,yn,1/n) weak learner h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) h8 (x1,y1,w1), … (xn,yn,wn) h9 (x1,y1,w1), … (xn,yn,wn) hT (x1,y1,w1), … (xn,yn,wn) Final rule: Sign [ ] h1   h2   hT  

Adaboost Binary labels y = -1,+1 margin(x,y) = y [  t  t  h t (x)] P(x,y) = (1/Z) exp (-margin(x,y)) Given h t, we choose  t to minimize  (x,y) exp (-margin(x,y))

Main property of adaboost If advantages of weak rules over random guessing are:      T then in-sample error of final rule is at most (w.r.t. the initial weights)

A demo

Adaboost as gradient descent Discriminator class: a linear discriminator in the space of “weak hypotheses” Original goal: find hyper plane with smallest number of mistakes –Known to be an NP-hard problem (no algorithm that runs in time polynomial in d, where d is the dimension of the space) Computational method: Use exponential loss as a surrogate, perform gradient descent.

Margins view Prediction = w Correct Mistakes Project Margin Cumulative # examples MistakesCorrect Margin =

Adaboost et al. Loss Correct Margin Mistakes Brownboost Logitboost Adaboost = 0-1 loss

One coordinate at a time Adaboost performs gradient descent on exponential loss Adds one coordinate (“weak learner”) at each iteration. Weak learning in binary classification = slightly better than random guessing. Weak learning in regression – unclear. Uses example-weights to communicate the gradient direction to the weak learner Solves a computational problem

What is a good weak learner? The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Small enough to allow exhaustive search for the minimal weighted training error. Small enough to avoid over-fitting. Should be able to calculate predicted label very efficiently. Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

Alternating Trees Joint work with Llew Mason

Decision Trees X>3 Y>5 +1 no yes no X Y

-0.2 Decision tree as a sum X Y -0.2 Y> yes no X> no yes sign

An alternating decision tree X Y sign -0.2 Y> yes no X> no yes Y<1 0.0 no yes

Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

Adtree for Cleveland heart-disease diagnostics problem

Cross-validated accuracy Learning algorithm Number of splits Average test error Test error variance ADtree617.0%0.6% C %0.5% C5.0 + boosting %0.5% Boost Stumps %0.8%

Curious phenomenon Boosting decision trees Using 2,000,000 parameters

Explanation using margins Margin 0-1 loss

Explanation using margins Margin 0-1 loss No examples with small margins!!

Experimental Evidence

Theorem For any convex combination and any threshold Probability of mistake Fraction of training example with small margin Size of training sample VC dimension of weak rules No dependence on number of weak rules of weak rules that are combined!!! Schapire, Freund, Bartlett & Lee Annals of stat. 98

Suggested optimization problem Margin

Idea of Proof

Applications of Boosting Academic research Applied research Commercial deployment

Academic research DatabaseOtherBoosting Error reduction Cleveland27.2 (DT)16.539% Promoters22.0 (DT)11.846% Letter13.8 (DT)3.574% Reuters 45.8, 6.0, ~60% Reuters 811.3, 12.1, ~40% % test error rates

Applied research “AT&T, How may I help you?” Classify voice requests Voice -> text -> category Fourteen categories Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge,time Schapire, Singer, Gorin 98

Yes I’d like to place a collect call long distance please Operator I need to make a call but I need to bill it to my office Yes I’d like to place a call on my master card please I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill Examples  collect  third party  billing credit  calling card

Calling card Collect call Third party Weak Rule Category Word occurs Word does not occur Weak rules generated by “boostexter”

Results 7844 training examples – hand transcribed 1000 test examples –hand / machine transcribed Accuracy with 20% rejected –Machine transcribed: 75% –Hand transcribed: 90%

Commercial deployment Distinguish business/residence customers Using statistics from call-detail records Alternating decision trees –Similar to boosting decision trees, more flexible –Combines very simple rules –Can over-fit, cross validation used to stop Freund, Mason, Rogers, Pregibon, Cortes 2000

Massive datasets 260M calls / day 230M telephone numbers Label unknown for ~30% Hancock: software for computing statistical signatures. 100K randomly selected training examples, ~10K is enough Training takes about 2 hours. Generated classifier has to be both accurate and efficient

Alternating tree for “buizocity”

Alternating Tree (Detail)

Precision/recall graphs Score Accuracy

Business impact Increased coverage from 44% to 56% Accuracy ~94% Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

Summary Boosting is a computational method for learning accurate classifiers Resistance to over-fit explained by margins Underlying explanation – large “neighborhoods” of good classifiers Boosting has been applied successfully to a variety of classification problems

Come talk with me!