Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale.

Slides:



Advertisements
Similar presentations
Koby Crammer Department of Electrical Engineering
Advertisements

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
An Introduction of Support Vector Machine
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Separating Hyperplanes
Data mining in 1D: curve fitting
Planning under Uncertainty
Linear Separators.
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Reduced Support Vector Machine
Learning to Align Polyphonic Music. Slide 1 Learning to Align Polyphonic Music Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Support Vector Machines
The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion.
Online Learning Algorithms
Strategy-Proof Classification Reshef Meir School of Computer Science and Engineering, Hebrew University A joint work with Ariel. D. Procaccia and Jeffrey.
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
SVM by Sequential Minimal Optimization (SMO)
CS 478 – Tools for Machine Learning and Data Mining The Need for and Role of Bias.
Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Classification: Feature Vectors
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
Online Transfer Learning Algorithm ~ The Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS2009) Propose the first framework.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.
Logistic Regression William Cohen.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Computational Learning Theory Part 1: Preliminaries 1.
Bounds on Redundancy in Constrained Delay Arithmetic Coding Ofer ShayevitzEado Meron Meir Feder Ram Zamir Tel Aviv University.
Smooth ε -Insensitive Regression by Loss Symmetrization Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer School of Computer Science and Engineering The Hebrew.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
An Efficient Online Algorithm for Hierarchical Phoneme Classification
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Online Learning Kernels
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Version Space Machine Learning Fall 2018.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale Kernel Machine NIPS’05, Whistler The Hebrew University Jerusalem, Israel

Forgetron Slide 2 Overview Online learning with kernels Goal: strict limit on the number of “support vectors” The Forgetron algorithm Analysis Experiments

Forgetron Slide 3 current classifier: f t (x) =  i 2 I y i K(x i,x) Kernel-based Perceptron for Online Learning Online Learner #Mistakes M Current Active-Set I = {1,3} xtxt sign(f t (x t )) ytyt  I = {1,3,4} sign(f t (x t ))

Forgetron Slide 4 current classifier: f t (x) =  i 2 I y i K(x i,x) Kernel-based Perceptron for Online Learning Online Learner #Mistakes M Current Active-Set xtxt sign(f t (x t )) ytyt  I = {1,3,4} sign(f t (x t ))

Forgetron Slide 5 Learning on a Budget |I| = number of mistakes until round t Memory + time inefficient |I| might grow unboundedly Goal: Construct a kernel-based online algorithm for which: |I| · B for each t Still performs “well”  comes with performance guarantee

Forgetron Slide 6 Mistake Bound for Perceptron {(x 1,y 1 ),…,(x T,y T )} : a sequence of examples A kernel K s.t. K(x t,x t ) · 1 g : a fixed competitor classifier in RKHS Define ` t (g)= max(0,1 – y t g(x t )) Then,

Forgetron Slide 7 Previous Work Crammer, Kandola, Singer (2003) Kivinen, Smola, Williamson (2004) Weston, Bordes, Bottu (2005) Previous online budget algorithms do not provide a mistake bound Is our goal attainable ?

Forgetron Slide 8 Mission Impossible Input space: {e 1,…,e B+1 } Linear kernel: K(e i,e j ) = e i ¢ e j =  i,j Budget constraint: |I| · B. Therefore, there exists j s.t.  i 2 I  i K(e i,e j ) = 0 We might always err But, the competitor: g=  i e i never errs ! Perceptron makes B+1 mistakes

Forgetron Slide 9 Redefine the Goal We must restrict the competitor g somehow. One way: restrict ||g|| The counter example implies that we cannot compete with ||g|| ¸ (B+1) 1/2 Main result: The Forgetron algorithm can compete with any classifier g s.t.

Forgetron Slide 10 The Forgetron t-1 t f t (x) =  i 2 I  i y i K(x i,x) Step (1) - Perceptron I’ = I [ {t} Step (2) – Shrinking Step (3) – Remove Oldest i  t ii  t i t-1 t r = min I I  I [ {t}

Forgetron Slide 11 Shrinking: a two-edged sword  t is small   r is small  “deviation” due to removal is negligible  t is small  “deviation” due to shrinking is large The Forgetron formalizes “deviation” and automatically balances the tradeoff

Forgetron Slide 12 Quantifying Deviation “Progress” measure:  t = ||f t – g|| 2 - ||f t+1 -g|| 2 “Progress” for each update step “Deviation” is measured by negative progress  t =  t +  t +  t ||f t -g|| 2 -||f’-g|| 2 ||f’-g|| 2 -||f’’-g|| 2 ||f’’-g|| 2 -||f t+1 -g|| 2 after shrinking after removal after Perceptron

Forgetron Slide 13 Quantifying Deviation The Forgetron sets: Gain from Perceptron step: Damage from shrinking: Damage from removal:

Forgetron Slide 14 Resulting Mistake Bound For any g s.t. the number of prediction mistakes the Forgetron makes is at most

Forgetron Slide 15 Small deviation  Mistake Bound Assume low deviation: Perceptron’s progress: g f f’ ||f-g|| 2 ||f’-g|| 2

Forgetron Slide 16 Small deviation  Mistake Bound On one hand: positive progress towards good competitors On other hand: total possible progress is Corollary: Small deviation  mistake bound

Forgetron Slide 17 Deviation due to Removal Assume that on round t we remove example r with weight . Then, Perceptron’s progress Deviation from removal denote  t Remarks:  small   small   t small  t decreases with y r f’’ t (x r )

Forgetron Slide 18 Deviation due to Shrinking Case I: After shrinking, ||f’’ t || ¸ ||g|| ||f’-g|| 2 ||f’’-g|| 2  t ¸ 0

Forgetron Slide 19 Deviation due to Shrinking Case II: After shrinking, ||f’’ t || · ||g||=U ||f’-g|| 2  t ¸ – U 2 ( 1 -  t ) ||f’’-g|| 2

Forgetron Slide 20 Self-tuning Shrinking Mechanism The Forgetron sets  t to the maximal value in (0,1] for which the deviation from removal is small The above has an analytic solution By construction, total deviation caused by removal is at most (15/32) M It can be shown (strong induction) that the total deviation caused by shrinking is at most (1/32) M

Forgetron Slide 21 Experiments Gaussian kernel Compare performance to Crammer, Kandola & Singer (CKS), NIPS’03 Measure the number of prediction mistakes as a function of the budget The base line is the performance of the Perceptron

Forgetron Slide 22 Experiment I: MNIST dataset

Forgetron Slide 23 Experiment II: Census-income (adult) … (Perceptron makes 16,000 mistakes)

Forgetron Slide 24 Experiment III: Synthetic Data with Label Noise

Forgetron Slide 25 Summary No budget algorithm can compete with arbitrary hypotheses The Forgetron can compete with norm-bounded hypotheses Works well in practice Does not require parameters Future work: the Forgetron for batch learning