Download presentation
Presentation is loading. Please wait.
1
Boosting Approach to ML
Perceptron, Margins, Kernels Maria-Florina Balcan 03/18/2015
2
Recap from last time: Boosting
General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor. Adaboost one of the top 10 ML algorithms. Works amazingly well in practice. Add a slide with Adaboost! Backed up by solid foundations.
3
Adaboost (Adaptive Boosting)
Input: S={( x 1 , π¦ 1 ), β¦,( x m , π¦ m )}; x i βπ, π¦ π βπ={β1,1} weak learning algo A (e.g., NaΓ―ve Bayes, decision stumps) For t=1,2, β¦ ,T + + + + Construct D t on { x 1 , β¦, x m } + h t + Run A on D t producing h t :πβ{β1,1} + + - Output H final π₯ =sign π‘=1 πΌ π‘ β π‘ π₯ - - - - - D 1 uniform on { x 1 , β¦, x m } [i.e., D 1 π = 1 π ] - - Given D t and h t set Add a slide with Adaboost! π· π‘+1 π = π· π‘ π π π‘ e β πΌ π‘ if π¦ π = β π‘ π₯ π π· π‘+1 π = π· π‘ π π π‘ e β πΌ π‘ π¦ π β π‘ π₯ π π· π‘+1 π = π· π‘ π π π‘ e πΌ π‘ if π¦ π β β π‘ π₯ π πΌ π‘ = 1 2 ln 1β π π‘ π π‘ >0 D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct
4
Nice Features of Adaboost
Very general: a meta-procedure, it can use any weak learning algorithm!!! (e.g., NaΓ―ve Bayes, decision stumps) Very fast (single pass through data each round) & simple to code, no parameters to tune. Grounded in rich theory.
5
Analyzing Training Error
Theorem π π‘ =1/2β πΎ π‘ (error of β π‘ over π· π‘ ) ππ π π π» πππππ β€ exp β2 π‘ πΎ π‘ 2 So, if βπ‘, πΎ π‘ β₯πΎ>0, then ππ π π π» πππππ β€ exp β2 πΎ 2 π The training error drops exponentially in T!!! To get ππ π π π» πππππ β€π, need only π=π 1 πΎ 2 log 1 π rounds Add a slide with Adaboost! Adaboost is adaptive Does not need to know πΎ or T a priori Can exploit πΎ π‘ β« πΎ
6
Generalization Guarantees
ππ π π π» πππππ β€ exp β2 π‘ πΎ π‘ 2 Theorem where π π‘ =1/2β πΎ π‘ How about generalization guarantees? Original analysis [Freund&Schapireβ97] H space of weak hypotheses; d=VCdim(H) π» πππππ is a weighted vote, so the hypothesis class is: G={all fns of the form sign( π‘=1 π πΌ π‘ β π‘ (π₯)) } Theorem [Freund&Schapireβ97] β πβπΊ, πππ π β€ππ π π π + π ππ π T= # of rounds Key reason: VCdππ πΊ = π ππ plus typical VC bounds.
7
Generalization Guarantees
Theorem [Freund&Schapireβ97] β πβπΊ, πππ π β€ππ π π π + π ππ π where d=VCdim(H) generalization error error train error complexity T= # of rounds
8
Generalization Guarantees
Experiments showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
9
Generalization Guarantees
Experiments showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!! These results seem to contradict FSβ97 bound and Occamβs razor (in order achieve good test error the classifier should be as simple as possible)! β πβπΊ, πππ π β€ππ π π π + π ππ π
10
How can we explain the experiments?
R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in βBoosting the margin: A new explanation for the effectiveness of voting methodsβ a nice theoretical explanation. Key Idea: Training error does not tell the whole story. We need also to consider the classification confidence!! Add a slide with Adaboost!
11
Boosting didnβt seem to overfitβ¦(!)
β¦because it turned out to be increasing the margin of the classifier test error of base classifier (weak learner) test error train error Error Curve, Margin Distr. Graph - Plots from [SFBL98]
12
Classification Margin
H space of weak hypotheses. The convex hull of H: ππ π» = π= π‘=1 π πΌ π‘ β π‘ , πΌ π‘ β₯0, π‘=1 π πΌ π‘ =1, β π‘ βπ» Let πβππ π» , π= π‘=1 π πΌ π‘ β π‘ , πΌ π‘ β₯0, π‘=1 π πΌ π‘ =1 . The majority vote rule π» π given by π (given by π» π =π πππ(π π₯ )) predicts wrongly on example (π₯,π¦) iff π¦π π₯ β€0. Definition: margin of π» π (or of π) on example (π₯,π¦) to be π¦π(π₯). π¦π π₯ =π¦ π‘=1 π πΌ π‘ β π‘ π₯ = π‘=1 π π¦ πΌ π‘ β π‘ π₯ = π‘:π¦= β π‘ π₯ πΌ π‘ β π‘:π¦β β π‘ π₯ πΌ π‘ Add a slide with Adaboost! The margin is positive iff π¦= π» π π₯ . See π¦π π₯ =|π π₯ | as the strength or the confidence of the vote. 1 Low confidence -1 High confidence, correct High confidence, incorrect
13
Boosting and Margins Theorem:VCdim(π»)=π, then with prob. β₯1βπΏ, βπβππ(π»), βπ>0, Pr π· π¦π π₯ β€0 β€ Pr π π¦π π₯ β€π +π 1 π d ln 2 π π π 2 + ln 1 πΏ Note: bound does not depend on T (the # of rounds of boosting), depends only on the complex. of the weak hyp space and the margin! Add a slide with Adaboost!
14
Boosting and Margins Theorem:VCdim(π»)=π, then with prob. β₯1βπΏ, βπβππ(π»), βπ>0, Pr π· π¦π π₯ β€0 β€ Pr π π¦π π₯ β€π +π 1 π d ln 2 π π π 2 + ln 1 πΏ If all training examples have large margins, then we can approximate the final classifier by a much smaller classifier. Can use this to prove that better margin ο¨ smaller test error, regardless of the number of weak classifiers. Can also prove that boosting tends to increase the margin of training examples by concentrating on those of smallest margin. Add a slide with Adaboost! Although final classifier is getting larger, margins are likely to be increasing, so the final classifier is actually getting closer to a simpler classifier, driving down test error.
15
Boosting and Margins Theorem:VCdim(π»)=π, then with prob. β₯1βπΏ, βπβππ(π»), βπ>0, Pr π· π¦π π₯ β€0 β€ Pr π π¦π π₯ β€π +π 1 π d ln 2 π π π 2 + ln 1 πΏ Note: bound does not depend on T (the # of rounds of boosting), depends only on the complex. of the weak hyp space and the margin! Add a slide with Adaboost!
16
Boosting, Adaboost Summary
Shift in mindset: goal is now just to find classifiers a bit better than random guessing. Backed up by solid foundations. Adaboost work and its variations well in practice with many kinds of data (one of the top 10 ML algos). More about classic applications in Recitation. Relevant for big data age: quickly focuses on βcore difficultiesβ, so well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLTβ12]. Issues: noise. Weak learnersβ¦
17
Interestingly, the usefulness of margin recognized in Machine Learning since late 50βs.
Perceptron [Rosenblattβ57] analyzed via geometric (aka πΏ 2 , πΏ 2 ) margin. Original guarantee in the online learning scenario. Issues: noise. Weak learnersβ¦
18
The Perceptron Algorithm
Online Learning Model Margin Analysis Kernels Issues: noise. Weak learnersβ¦
19
The Online Learning Model
Example arrive sequentially. We need to make a prediction. Afterwards observe the outcome. For i=1, 2, β¦, : Online Algorithm Example π₯ π Phase i: Prediction β( π₯ π ) Observe c β ( π₯ π ) Mistake bound model Issues: noise. Weak learnersβ¦ Analysis wise, make no distributional assumptions. Goal: Minimize the number of mistakes.
20
The Online Learning Model. Motivation
- classification (distribution of both spam and regular mail changes over time, but the target function stays fixed - last year's spam still looks like spam). - Recommendation systems. Recommending movies, etc. - Predicting whether a user will be interested in a new news article or not. - Add placement in a new market. Issues: noise. Weak learnersβ¦
21
Linear Separators w Instance space X= R d
Hypothesis class of linear decision surfaces in R d . h x =wβ
x + w 0 , if β π₯ β₯ 0, then label x as +, otherwise label it as - Claim: WLOG w 0 =0. Proof: Can simulate a non-zero threshold with a dummy input feature π₯ 0 that is always set up to 1. Issues: noise. Weak learnersβ¦ π₯= π₯ 1 , β¦, π₯ π β π₯ = π₯ 1 , β¦, π₯ π ,1 wβ
x + w 0 β₯0 iff π€ 1 ,β¦, π€ π , w 0 β
π₯ β₯0 where w= π€ 1 ,β¦, π€ π
22
Linear Separators: Perceptron Algorithm
Set t=1, start with the all zero vector π€ 1 . Given example π₯, predict positive iff π€ π‘ β
π₯β₯0 On a mistake, update as follows: Mistake on positive, then update π€ π‘+1 β π€ π‘ +π₯ Mistake on negative, then update π€ π‘+1 β π€ π‘ βπ₯ Note: π€ π‘ is weighted sum of incorrectly classified examples Issues: noise. Weak learnersβ¦ π€ π‘ = π π 1 π₯ π 1 +β¦+ π π π π₯ π π π€ π‘ β
π₯= π π 1 π₯ π 1 β
π₯+β¦+ π π π π₯ π π β
π₯ Important when we talk about kernels.
23
Perceptron Algorithm: Example
β1,2 β X ο‘ - 1,0 + 1,1 + X + ο‘ β1,0 β - + β1,β2 β X ο‘ + 1,β1 + - Algorithm: Set t=1, start with all-zeroes weight vector π€ 1 . Given example π₯, predict positive iff π€ π‘ β
π₯β₯0. On a mistake, update as follows: Mistake on positive, update π€ π‘+1 β π€ π‘ +π₯ Mistake on negative, update π€ π‘+1 β π€ π‘ βπ₯ π€ 1 =(0,0) π€ 2 = π€ 1 β β1,2 =(1,β2) π€ 3 = π€ 2 + 1,1 =(2,β1) π€ 4 = π€ 3 β β1,β2 =(3,1)
24
Geometric Margin Definition: The margin of example π₯ w.r.t. a linear sep. π€ is the distance from π₯ to the plane π€β
π₯=0 (or the negative if on wrong side) π₯ 1 w Margin of positive example π₯ 1 π₯ 2 Margin of negative example π₯ 2
25
Geometric Margin Definition: The margin of example π₯ w.r.t. a linear sep. π€ is the distance from π₯ to the plane π€β
π₯=0 (or the negative if on wrong side) Definition: The margin πΎ π€ of a set of examples π wrt a linear separator π€ is the smallest margin over points π₯βπ. + - πΎ π€ w
26
Geometric Margin Definition: The margin of example π₯ w.r.t. a linear sep. π€ is the distance from π₯ to the plane π€β
π₯=0 (or the negative if on wrong side) Definition: The margin πΎ π€ of a set of examples π wrt a linear separator π€ is the smallest margin over points π₯βπ. Definition: The margin πΎ of a set of examples π is the maximum πΎ π€ over all linear separators π€. + - πΎ w
27
Perceptron: Mistake Bound
Theorem: If data has margin πΎ and all points inside a ball of radius π
, then Perceptron makes β€ π
/πΎ 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesnβt change the number of mistakes; algo is invariant to scaling.) + w* - ο§ R
28
Perceptron Algorithm: Analysis
Theorem: If data has margin πΎ and all points inside a ball of radius π
, then Perceptron makes β€ π
/πΎ 2 mistakes. Update rule: Mistake on positive: π€ π‘+1 β π€ π‘ +π₯ Mistake on negative: π€ π‘+1 β π€ π‘ βπ₯ Proof: Idea: analyze π€ π‘ β
π€ β and β π€ π‘ β, where π€ β is the max-margin sep, β π€ β β=1. Claim 1: π€ π‘+1 β
π€ β β₯ π€ π‘ β
π€ β +πΎ. (because π π₯ π₯β
π€ β β₯πΎ) Claim 2: π€ π‘ β€ π€ π‘ π
2 . (by Pythagorean Theorem) π€ π‘ π€ π‘+1 π₯ After π mistakes: π€ π+1 β
π€ β β₯πΎπ (by Claim 1) π€ π+1 β€π
π (by Claim 2) π€ β π€ π+1 π€ π+1 β
π€ β β€β π€ π+1 β (since π€ β is unit length) So, πΎπβ€π
π , so πβ€ π
πΎ 2 .
29
Perceptron Extensions
Can use it to find a consistent separator (by cycling through the data). One can convert the mistake bound guarantee into a distributional guarantee too (for the case where the π₯ π s come from a fixed distribution). Can be adapted to the case where there is no perfect separator as long as the so called hinge loss (i.e., the total distance needed to move the points to classify them correctly large margin) is small. Can be kernelized to handle non-linear decision boundaries!
30
Perceptron Discussion
Simple online algorithm for learning linear separators with a nice guarantee that depends only on the geometric (aka πΏ 2 , πΏ 2 ) margin. It can be kernelized to handle non-linear decision boundaries --- see next class! Simple, but very useful in applications like Branch prediction; it also has interesting extensions to structured prediction.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.