Online Learning Algorithms

Slides:



Advertisements
Similar presentations
Koby Crammer Department of Electrical Engineering
Advertisements

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
On-line learning and Boosting
Linear Classifiers (perceptrons)
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Chapter 4: Linear Models for Classification
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
The loss function, the normal equation,
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Ensemble Learning: An Introduction
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Machine learning Image source:
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Machine learning Image source:
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Classification: Feature Vectors
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Linear Models for Classification
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Logistic Regression William Cohen.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Learning with General Similarity Functions Maria-Florina Balcan.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
KNN & Naïve Bayes Hongning Wang
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Dan Roth Department of Computer and Information Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Dan Roth Department of Computer and Information Science
Classification with Perceptrons Reading:
CS 4/527: Artificial Intelligence
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
CSCI B609: “Foundations of Data Science”
Overview of Machine Learning
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS249: Neural Language Model
Presentation transcript:

Online Learning Algorithms

Outline Online learning Framework Design principles of online learning algorithms (additive updates) Perceptron, Passive-Aggressive and Confidence weighted classification Classification – binary, multi-class and structured prediction Hypothesis averaging and Regularization Multiplicative updates Weighted majority, Winnow, and connections to Gradient Descent(GD) and Exponentiated Gradient Descent (EGD)

Formal setting – Classification Instances Images, Sentences Labels Parse tree, Names Prediction rule Linear prediction rule Loss No. of mistakes

Predictions Continuous predictions : Linear Classifiers Label Confidence Linear Classifiers Prediction : Confidence:

Loss Functions Natural Loss: Real-valued-predictions loss: Zero-One loss: Real-valued-predictions loss: Hinge loss: Exponential loss (Boosting)

Loss Functions Hinge Loss Zero-One Loss 1 1

Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule Goal : Suffer small cumulative loss

Margin Margin of an example with respect to the classifier : Note : The set is separable iff there exists u such that

Geometrical Interpretation Margin <<0 Margin >0 Margin >>0 Margin <0

Hinge Loss

Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive

Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al) MIRA (Crammer & Singer) EG (Littlestown and Warmuth) Bregman Based (Warmuth) CWL (Dredge et. al)

Design Principles of Algorithms If the learner suffers non-zero loss at any round, then we want to balance two goals: Corrective: Change weights enough so that we don’t make this error again (1) Conservative: Don’t change the weights too much (2) How to define too much ?

Design Principles of Algorithms If we use Euclidean distance to measure the change between old and new weights Enforcing (1) and minimizing (2) e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares) Passive-Aggressive algorithms do exactly same except (1) is much stronger – we want to make a correct classification with margin of at least 1 Confidence-Weighted classifiers maintains a distribution over weight vectors (1) is same as passive-aggressive with a probabilistic notion of margin Change is measured by KL divergence between two distributions

Design Principles of Algorithms If we assume all weights are positive we can use (unnormalized) KL divergence to measure the change Multiplicative update or EG algorithm (Kivinen and Warmuth)

The Perceptron Algorithm If No-Mistake Do nothing If Mistake Update Margin after update:

Passive-Aggressive Algorithms

Passive-Aggressive: Motivation Perceptron: No guaranties of margin after the update PA: Enforce a minimal non-zero margin after the update In particular: If the margin is large enough (1), then do nothing If the margin is less then unit, update such that the margin after the update is enforced to be unit

Aggressive Update Step Set to be the solution of the following optimization problem: Closed-form update: (2) (1) where,

Passive-Aggressive Update

Unrealizable Case

Confidence Weighted Classification

Confidence-Weighted Classification: Motivation Many positive reviews with the word best Wbest Later negative review “boring book – best if you want to sleep in seconds” Linear update will reduce both Wbest Wboring But best appeared more than boring How to adjust weights at different rates? Wboring Wbest

Update Rules The weight vector is a linear combination of examples Two rate schedules (among others): Perceptron algorithm, conservative: Passive-aggressive

Distributions in Version Space Mean weight-vector Example

Margin as a Random Variable Signed margin is a Gaussian-distributed variable Thus:

PA-like Update PA: New Update :

Weight Vector (Version) Space Place most of the probability mass in this region

Passive Step Nothing to do, most weight vectors already classify the example correctly

Aggressive Step Mean moved past the mistake line (large margin) The covariance is shirked in the direction of the new example Project the current Gaussian distribution onto the half-space

Extensions: Multi-class and Structured Prediction

Multiclass Representation I k Prototypes New instance Compute Prediction: the class achieving the highest Score Class r 1 -1.08 2 1.66 3 0.37 4 -2.09

Multiclass Representation II Map all input and labels into a joint vector space Score labels by projecting the corresponding feature vector F Estimated volume was a light 2.4 million ounces . = (0 1 1 0 … ) B I O B I I I I O

Multiclass Representation II Predict label with highest score (Inference) Naïve search is expensive if the set of possible labels is large No. of labelings = 3No. of words Estimated volume was a light 2.4 million ounces . B I O B I I I I O Efficient Viterbi decoding for sequences!

Two Representations Weight-vector per class (Representation I) Intuitive Improved algorithms Single weight-vector (Representation II) Generalizes representation I Allows complex interactions between input and output x F(x,4) =

Margin for Multi Class Binary: Multi Class:

Margin for Multi Class But different mistakes cost (aka loss function) differently – so use it! Margin scaled by loss function:

Perceptron Multiclass online algorithm Initialize For Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule

PA Multiclass online algorithm Initialize For Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule

Regularization Key Idea: Popular choices: If an online algorithm works well on a sequence of i.i.d examples, then an ensemble of online hypotheses should generalize well. Popular choices: the averaged hypothesis the majority vote use validation set to make a choice