Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.

Slides:



Advertisements
Similar presentations
Koby Crammer Department of Electrical Engineering
Advertisements

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
G53MLE | Machine Learning | Dr Guoping Qiu
On-line learning and Boosting
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.
Boosting Approach to ML
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Separating Hyperplanes
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
The loss function, the normal equation,
Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale.
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
A Brief Introduction to Adaboost
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Linear Discriminators Chapter 20 From Data to Knowledge.
Machine learning Image source:
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Online Learning Algorithms
Machine learning Image source:
Neural Networks Lecture 8: Two simple learning algorithms
Machine Learning CSE 681 CH2 - Supervised Learning.
Classification: Feature Vectors
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Today Ensemble Methods. Recap of the course. Classifier Fusion
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Linear Discrimination Reading: Chapter 2 of textbook.
Protein Classification Using Averaged Perceptron SVM
Online Learning Yiling Chen. Machine Learning Use past observations to automatically learn to make better predictions or decisions in the future A large.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Online Transfer Learning Algorithm ~ The Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS2009) Propose the first framework.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
Machine Learning – Classification David Fenyő
Dan Roth Department of Computer and Information Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CH. 2: Supervised Learning
Classification with Perceptrons Reading:
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
Machine Learning Week 1.
Linear Discriminators
Data Mining Practical Machine Learning Tools and Techniques
Blind Signal Separation using Principal Components Analysis
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
CSCI B609: “Foundations of Data Science”
The
Online Learning Kernels
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
CS639: Data Management for Data Science
Supervised machine learning: creating a model
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Online Learning Rong Jin

Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are received one at each time ?

Online Learning For t=1, 2, … T Receive an instance Predict its class label Receive the true class label Encounter loss Update the classification model

4 Objective Minimize the total loss Loss function Zero-One loss: Hinge loss:

5 Loss Functions 1 1 Zero-One Loss Hinge Loss

6 Restrict our discussion to linear classifier Prediction: Confidence: Linear Classifiers

7 Separable Set

8 Inseparable Sets

9 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Regret/Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

10 Update Rules Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input Some Update Rules : –P–Perceptron (Rosenblat) –A–ALMA (Gentile) –R–ROMMA (Li & Long) –N–NORMA (Kivinen et. al) –M–MIRA (Crammer & Singer) –E–EG (Littlestown and Warmuth) –B–Bregman Based (Warmuth)

Perceptron Initialize For t=1, 2, … T Receive an instance Predict its class label Receive the true class label If then

12 Geometrical Interpretation

Mistake Bound: Separable Case Assume the data set D is linearly separable with margin , i.e., Assume Then, the maximum number of mistakes made by the Perceptron algorithm is bounded by

Mistake Bound: Separable Case

Mistake Bound: Inseparable Case Let be the best linear classifier We measure our progress by Consider we make a mistake for

Mistake Bound: Inseparable Case Result 1:

Mistake Bound: Inseparable Case Result 2

Perceptron with Projection Initialize For t=1, 2, … T Receive an instance Predict its class label Receive the true class label If then

19 Remarks Mistake bound is measured for a sequence of classifiers Bound does not depend on dimension of the feature vector The bound holds for all sequences (no i.i.d. assumption). It is not tight for most real world data. But, it can not be further improved in general.

Perceptron Initialize For t=1, 2, … T Receive an instance Predict its class label Receive the true class label If then Conservative: updates the classifier only when it misclassifies

Aggressive Perceptron Initialize For t=1, 2, … T Receive an instance Predict its class label Receive the true class label If then

Regret Bound

Learning a Classifier The evaluation (mistake bound or regret bound) concerns a sequence of classifiers But, by the end of the day, which classifier should used ? The last? By Cross Validation ?

Learning with Expert Advice Learning to combine the predictions from multiple experts An ensemble of d experts: Combination weights: Combined classifier

Hedge Simple Case There exists one expert, denoted by, who can perfectly classify all the training examples What is your learning strategy ? Difficult case What if we don’t have such a perfect expert ?

Hedge Algorithm

Hedge Algorithm Initialize For t=1, 2, … T Receive a training example Prediction If then For i=1, 2, …, d If then

Mistake Bound

Measure the progress Lower bound

Mistake Bound Upper bound

Mistake Bound Upper bound

Mistake Bound