Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.

Slides:



Advertisements
Similar presentations
Koby Crammer Department of Electrical Engineering
Advertisements

Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
On-line learning and Boosting
Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
The loss function, the normal equation,
Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale.
Instructor : Dr. Saeed Shiry
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
Machine Learning Week 2 Lecture 2.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Visual Recognition Tutorial
Artificial Neural Networks
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Online Learning Algorithms
Support Vector Machines
Simulation Output Analysis
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Universit at Dortmund, LS VIII
Benk Erika Kelemen Zsolt
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Smooth ε -Insensitive Regression by Loss Symmetrization Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer School of Computer Science and Engineering The Hebrew.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
An Efficient Online Algorithm for Hierarchical Phoneme Classification
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Dan Roth Department of Computer and Information Science
Empirical risk minimization
Lecture 07: Soft-margin SVM
Understanding Generalization in Adaptive Data Analysis
CH. 2: Supervised Learning
Generalization and adaptivity in stochastic convex optimization
CS 4/527: Artificial Intelligence
Rank Aggregation.
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
The
Online Learning Kernels
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Computational Learning Theory
10701 / Machine Learning Today: - Cross validation,
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Lecture 08: Soft-margin SVM
Integration of sensory modalities
Computational Learning Theory
Lecture 07: Soft-margin SVM
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Empirical risk minimization
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel

Three Decision Problems ClassificationRegressionUniclass

Receive instance n/a Predict target value Receive true target ; suffer loss Update hypothesis Online Setting Classification Regression Uniclass

A Unified View Define discrepancy for : Unified Hinge-Loss: Notion of Realizability: Classification Regression Uniclass

A Unified View (Cont.) Online Convex Programming: –Let be a sequence of convex functions: –Let be an insensitivity parameter. –For Guess a vector Get the current convex function Suffer loss –Goal: minimize the cumulative loss

The Passive-Aggressive Algorithm Each example defines a set of consistent hypotheses: The new vector is set to be the projection of onto ClassificationRegressionUniclass

Passive-Aggressive

An Analytic Solution where and Classification Regression Uniclass

Loss Bounds Theorem: – - a sequence of examples. –Assumption: –Then if the online algorithm is run with, the following bound holds for any where for classification and regression and for uniclass.

Loss bounds (cont.) For the case of classification we have one degree of freedom since if then for any Therefore, we can set and get the following bounds:

Loss bounds (Cont). Classification Uniclass

Proof Sketch Define: Upper bound: Lower bound: Lipschitz Condition

Proof Sketch (Cont.) Combining upper and lower bounds

The Unrealizable Case Main idea: downsize step size by

Loss Bound Theorem: – - sequence of examples. –bound for any and for any

Implications for Batch Learning Batch Setting: –Input: A training set, sampled i.i.d according to an unknown distribution D. –Output: A hypothesis parameterized by –Goal: Minimize Online Setting: –Input: A sequence of examples –Output: A sequence of hypotheses –Goal: Minimize

Implications for Batch Learning (Cont.) Convergence: Let be a fixed training set and let be the vector obtained by PA after epochs. Then, for any Large margin for classification: For all we have:, which implies that the margin attained by PA for classification is at least half the optimal margin

Derived Generalization Properties Average hypothesis: Let be the average hypothesis. Then, with high probability we have

A Multiplicative Version Assumption: Multiplicative update: Loss bound:

Summary Unified view of three decision problems New algorithms for prediction with hinge loss Competitive loss bounds for hinge loss Unrealizable Case: Algorithms & Analysis Multiplicative Algorithms Batch Learning Implications Future Work & Extensions: Updates using general Bregman projections Applications of PA to other decision problems

Related Work Projections Onto Convex Sets (POCS), e.g.: – Y. Censor and S.A. Zenios, “Parallel Optimization” –H.H. Bauschke and J.M. Borwein, “On Projection Algorithms for Solving Convex Feasibility Problems” Online Learning, e.g.: –M. Herbster, “Learning additive models online with fast evaluating kernels”