Presentation is loading. Please wait.

Presentation is loading. Please wait.

Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale.

Similar presentations


Presentation on theme: "Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale."— Presentation transcript:

1 Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale Kernel Machine NIPS’05, Whistler The Hebrew University Jerusalem, Israel

2 Forgetron Slide 2 Overview Online learning with kernels Goal: strict limit on the number of “support vectors” The Forgetron algorithm Analysis Experiments

3 Forgetron Slide 3 current classifier: f t (x) =  i 2 I y i K(x i,x) Kernel-based Perceptron for Online Learning Online Learner #Mistakes M Current Active-Set 1 2 3 4 5 6 7... I = {1,3} xtxt sign(f t (x t )) ytyt  I = {1,3,4} sign(f t (x t ))

4 Forgetron Slide 4 current classifier: f t (x) =  i 2 I y i K(x i,x) Kernel-based Perceptron for Online Learning Online Learner #Mistakes M Current Active-Set 1 2 3 4 5 6 7... xtxt sign(f t (x t )) ytyt  I = {1,3,4} sign(f t (x t ))

5 Forgetron Slide 5 Learning on a Budget |I| = number of mistakes until round t Memory + time inefficient |I| might grow unboundedly Goal: Construct a kernel-based online algorithm for which: |I| · B for each t Still performs “well”  comes with performance guarantee

6 Forgetron Slide 6 Mistake Bound for Perceptron {(x 1,y 1 ),…,(x T,y T )} : a sequence of examples A kernel K s.t. K(x t,x t ) · 1 g : a fixed competitor classifier in RKHS Define ` t (g)= max(0,1 – y t g(x t )) Then,

7 Forgetron Slide 7 Previous Work Crammer, Kandola, Singer (2003) Kivinen, Smola, Williamson (2004) Weston, Bordes, Bottu (2005) Previous online budget algorithms do not provide a mistake bound Is our goal attainable ?

8 Forgetron Slide 8 Mission Impossible Input space: {e 1,…,e B+1 } Linear kernel: K(e i,e j ) = e i ¢ e j =  i,j Budget constraint: |I| · B. Therefore, there exists j s.t.  i 2 I  i K(e i,e j ) = 0 We might always err But, the competitor: g=  i e i never errs ! Perceptron makes B+1 mistakes

9 Forgetron Slide 9 Redefine the Goal We must restrict the competitor g somehow. One way: restrict ||g|| The counter example implies that we cannot compete with ||g|| ¸ (B+1) 1/2 Main result: The Forgetron algorithm can compete with any classifier g s.t.

10 Forgetron Slide 10 The Forgetron 1 2 3...... t-1 t f t (x) =  i 2 I  i y i K(x i,x) Step (1) - Perceptron I’ = I [ {t} Step (2) – Shrinking Step (3) – Remove Oldest i  t ii  t i 1 2 3...... t-1 t r = min I I  I [ {t}

11 Forgetron Slide 11 Shrinking: a two-edged sword  t is small   r is small  “deviation” due to removal is negligible  t is small  “deviation” due to shrinking is large The Forgetron formalizes “deviation” and automatically balances the tradeoff

12 Forgetron Slide 12 Quantifying Deviation “Progress” measure:  t = ||f t – g|| 2 - ||f t+1 -g|| 2 “Progress” for each update step “Deviation” is measured by negative progress  t =  t +  t +  t ||f t -g|| 2 -||f’-g|| 2 ||f’-g|| 2 -||f’’-g|| 2 ||f’’-g|| 2 -||f t+1 -g|| 2 after shrinking after removal after Perceptron

13 Forgetron Slide 13 Quantifying Deviation The Forgetron sets: Gain from Perceptron step: Damage from shrinking: Damage from removal:

14 Forgetron Slide 14 Resulting Mistake Bound For any g s.t. the number of prediction mistakes the Forgetron makes is at most

15 Forgetron Slide 15 Small deviation  Mistake Bound Assume low deviation: Perceptron’s progress: g f f’ ||f-g|| 2 ||f’-g|| 2

16 Forgetron Slide 16 Small deviation  Mistake Bound On one hand: positive progress towards good competitors On other hand: total possible progress is Corollary: Small deviation  mistake bound

17 Forgetron Slide 17 Deviation due to Removal Assume that on round t we remove example r with weight . Then, Perceptron’s progress Deviation from removal denote  t Remarks:  small   small   t small  t decreases with y r f’’ t (x r )

18 Forgetron Slide 18 Deviation due to Shrinking Case I: After shrinking, ||f’’ t || ¸ ||g|| ||f’-g|| 2 ||f’’-g|| 2  t ¸ 0

19 Forgetron Slide 19 Deviation due to Shrinking Case II: After shrinking, ||f’’ t || · ||g||=U ||f’-g|| 2  t ¸ – U 2 ( 1 -  t ) ||f’’-g|| 2

20 Forgetron Slide 20 Self-tuning Shrinking Mechanism The Forgetron sets  t to the maximal value in (0,1] for which the deviation from removal is small The above has an analytic solution By construction, total deviation caused by removal is at most (15/32) M It can be shown (strong induction) that the total deviation caused by shrinking is at most (1/32) M

21 Forgetron Slide 21 Experiments Gaussian kernel Compare performance to Crammer, Kandola & Singer (CKS), NIPS’03 Measure the number of prediction mistakes as a function of the budget The base line is the performance of the Perceptron

22 Forgetron Slide 22 Experiment I: MNIST dataset

23 Forgetron Slide 23 Experiment II: Census-income (adult) … (Perceptron makes 16,000 mistakes)

24 Forgetron Slide 24 Experiment III: Synthetic Data with Label Noise

25 Forgetron Slide 25 Summary No budget algorithm can compete with arbitrary hypotheses The Forgetron can compete with norm-bounded hypotheses Works well in practice Does not require parameters Future work: the Forgetron for batch learning


Download ppt "Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale."

Similar presentations


Ads by Google