Download presentation
Presentation is loading. Please wait.
1
Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale Kernel Machine NIPS’05, Whistler The Hebrew University Jerusalem, Israel
2
Forgetron Slide 2 Overview Online learning with kernels Goal: strict limit on the number of “support vectors” The Forgetron algorithm Analysis Experiments
3
Forgetron Slide 3 current classifier: f t (x) = i 2 I y i K(x i,x) Kernel-based Perceptron for Online Learning Online Learner #Mistakes M Current Active-Set 1 2 3 4 5 6 7... I = {1,3} xtxt sign(f t (x t )) ytyt I = {1,3,4} sign(f t (x t ))
4
Forgetron Slide 4 current classifier: f t (x) = i 2 I y i K(x i,x) Kernel-based Perceptron for Online Learning Online Learner #Mistakes M Current Active-Set 1 2 3 4 5 6 7... xtxt sign(f t (x t )) ytyt I = {1,3,4} sign(f t (x t ))
5
Forgetron Slide 5 Learning on a Budget |I| = number of mistakes until round t Memory + time inefficient |I| might grow unboundedly Goal: Construct a kernel-based online algorithm for which: |I| · B for each t Still performs “well” comes with performance guarantee
6
Forgetron Slide 6 Mistake Bound for Perceptron {(x 1,y 1 ),…,(x T,y T )} : a sequence of examples A kernel K s.t. K(x t,x t ) · 1 g : a fixed competitor classifier in RKHS Define ` t (g)= max(0,1 – y t g(x t )) Then,
7
Forgetron Slide 7 Previous Work Crammer, Kandola, Singer (2003) Kivinen, Smola, Williamson (2004) Weston, Bordes, Bottu (2005) Previous online budget algorithms do not provide a mistake bound Is our goal attainable ?
8
Forgetron Slide 8 Mission Impossible Input space: {e 1,…,e B+1 } Linear kernel: K(e i,e j ) = e i ¢ e j = i,j Budget constraint: |I| · B. Therefore, there exists j s.t. i 2 I i K(e i,e j ) = 0 We might always err But, the competitor: g= i e i never errs ! Perceptron makes B+1 mistakes
9
Forgetron Slide 9 Redefine the Goal We must restrict the competitor g somehow. One way: restrict ||g|| The counter example implies that we cannot compete with ||g|| ¸ (B+1) 1/2 Main result: The Forgetron algorithm can compete with any classifier g s.t.
10
Forgetron Slide 10 The Forgetron 1 2 3...... t-1 t f t (x) = i 2 I i y i K(x i,x) Step (1) - Perceptron I’ = I [ {t} Step (2) – Shrinking Step (3) – Remove Oldest i t ii t i 1 2 3...... t-1 t r = min I I I [ {t}
11
Forgetron Slide 11 Shrinking: a two-edged sword t is small r is small “deviation” due to removal is negligible t is small “deviation” due to shrinking is large The Forgetron formalizes “deviation” and automatically balances the tradeoff
12
Forgetron Slide 12 Quantifying Deviation “Progress” measure: t = ||f t – g|| 2 - ||f t+1 -g|| 2 “Progress” for each update step “Deviation” is measured by negative progress t = t + t + t ||f t -g|| 2 -||f’-g|| 2 ||f’-g|| 2 -||f’’-g|| 2 ||f’’-g|| 2 -||f t+1 -g|| 2 after shrinking after removal after Perceptron
13
Forgetron Slide 13 Quantifying Deviation The Forgetron sets: Gain from Perceptron step: Damage from shrinking: Damage from removal:
14
Forgetron Slide 14 Resulting Mistake Bound For any g s.t. the number of prediction mistakes the Forgetron makes is at most
15
Forgetron Slide 15 Small deviation Mistake Bound Assume low deviation: Perceptron’s progress: g f f’ ||f-g|| 2 ||f’-g|| 2
16
Forgetron Slide 16 Small deviation Mistake Bound On one hand: positive progress towards good competitors On other hand: total possible progress is Corollary: Small deviation mistake bound
17
Forgetron Slide 17 Deviation due to Removal Assume that on round t we remove example r with weight . Then, Perceptron’s progress Deviation from removal denote t Remarks: small small t small t decreases with y r f’’ t (x r )
18
Forgetron Slide 18 Deviation due to Shrinking Case I: After shrinking, ||f’’ t || ¸ ||g|| ||f’-g|| 2 ||f’’-g|| 2 t ¸ 0
19
Forgetron Slide 19 Deviation due to Shrinking Case II: After shrinking, ||f’’ t || · ||g||=U ||f’-g|| 2 t ¸ – U 2 ( 1 - t ) ||f’’-g|| 2
20
Forgetron Slide 20 Self-tuning Shrinking Mechanism The Forgetron sets t to the maximal value in (0,1] for which the deviation from removal is small The above has an analytic solution By construction, total deviation caused by removal is at most (15/32) M It can be shown (strong induction) that the total deviation caused by shrinking is at most (1/32) M
21
Forgetron Slide 21 Experiments Gaussian kernel Compare performance to Crammer, Kandola & Singer (CKS), NIPS’03 Measure the number of prediction mistakes as a function of the budget The base line is the performance of the Perceptron
22
Forgetron Slide 22 Experiment I: MNIST dataset
23
Forgetron Slide 23 Experiment II: Census-income (adult) … (Perceptron makes 16,000 mistakes)
24
Forgetron Slide 24 Experiment III: Synthetic Data with Label Noise
25
Forgetron Slide 25 Summary No budget algorithm can compete with arbitrary hypotheses The Forgetron can compete with norm-bounded hypotheses Works well in practice Does not require parameters Future work: the Forgetron for batch learning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.