Koby Crammer Department of Electrical Engineering

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

LEUCEMIA MIELOIDE AGUDA TIPO 0

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.

1 Chapter 40 - Physiology and Pathophysiology of Diuretic Action Copyright © 2013 Elsevier Inc. All rights reserved.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

ALGEBRAIC EXPRESSIONS

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)

ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Photo Composition Study Guide Label each photo with the category that applies to that image.

BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.

A pre-Weekend Talk on Online Learning TGIF Talk Series Purushottam Kar.

Test on Input, Output, Processing, & Storage Devices

ABC Technology Project

© S Haughton more than 3?

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.

Twenty Questions Subject: Twenty Questions

Squares and Square Root WALK. Solve each problem REVIEW:

Energy & Green Urbanism Markku Lappalainen Aalto University.

Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.

Chapter 5 Test Review Sections 5-1 through 5-4.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

What You Should Learn • Represent and classify real numbers.

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

We will resume in: 25 Minutes.

1 Unit 1 Kinematics Chapter 1 Day

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Classification Classification Examples

Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.

Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.

Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.

Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.

Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,

Online Learning Algorithms

Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.

Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.

Linear Discrimination Reading: Chapter 2 of textbook.

Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.

Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.

1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign

Smooth ε -Insensitive Regression by Loss Symmetrization Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer School of Computer Science and Engineering The Hebrew.

Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural networks and support vector machines

Lecture 07: Soft-margin SVM

CS 188: Artificial Intelligence

Machine Learning Week 1.

Lecture 07: Soft-margin SVM

CSCI B609: “Foundations of Data Science”

Online Learning Kernels

PEGASOS Primal Estimated sub-GrAdient Solver for SVM

Lecture 08: Soft-margin SVM

Lecture 07: Soft-margin SVM

CS639: Data Management for Data Science

Introduction to Neural Networks

Presentation transcript:

Koby Crammer Department of Electrical Engineering Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

Thanks Mark Dredze Alex Kulesza Avihai Mejer Edward Moroshko Francesco Orabona Fernando Pereira Yoram Singer Nina Vaitz

Tutorial Context Online Learning SVMs Tutorial Optimization Theory Real-World Data

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Online Learning Tyrannosaurus rex

Online Learning Triceratops

Online Learning Velocireptor Tyrannosaurus rex

Formal Setting – Binary Classification Instances Images, Sentences Labels Parse tree, Names Prediction rule Linear predictions rules Loss No. of mistakes

Predictions Discrete Predictions: Continuous predictions : Hard to optimize Continuous predictions : Label Confidence

Loss Functions Natural Loss: Real-valued-predictions loss: Zero-One loss: Real-valued-predictions loss: Hinge loss: Exponential loss (Boosting) Log loss (Max Entropy, Boosting)

Loss Functions Hinge Loss Zero-One Loss 1 1

Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule Goal : Suffer small cumulative loss

Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M For example a linear model We make prediction by computing the inner product Output is a bit (threshold) or can be a real number Loss can be 1 if an error or 0 otherwise, but can also use quadratic loss and others Change the weights e.g. by adding the input x times a scalar Suffer Loss l(y,y) Get True Label y

Linear Classifiers Any Features W.l.o.g. Binary Classifiers of the form Notation Abuse

Linear Classifiers (cntd.) Prediction : Confidence in prediction:

Linear Classifiers Input Instance to be classified Each instance is described by a finite set of numeric (float or integer) features To make prediction we weight them according the model, sum and take a threshold Duality between input and model Weight vector of classifier

Margin Margin of an example with respect to the classifier : Note : The set is separable iff there exists such that

Geometrical Interpretation

Geometrical Interpretation

Geometrical Interpretation

Geometrical Interpretation Margin <<0 Margin >0 Margin >>0 Margin <0

Hinge Loss

Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

The Perceptron Algorithm Rosenblat 1958 The Perceptron Algorithm If No-Mistake Do nothing If Mistake Update Margin after update :

Geometrical Interpretation

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Gradient Descent Consider the batch problem Simple algorithm: Initialize Iterate, for Compute Set

Stochastic Gradient Descent Consider the batch problem Simple algorithm: Initialize Iterate, for Pick a random index Compute Set

Stochastic Gradient Descent “Hinge” loss The gradient Simple algorithm: Initialize Iterate, for Pick a random index If then else Set The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Motivation Perceptron: No guaranties of margin after the update PA :Enforce a minimal non-zero margin after the update In particular : If the margin is large enough (1), then do nothing If the margin is less then unit, update such that the margin after the update is enforced to be unit

Input Space

Input Space vs. Version Space Points are input data One constraint is induced by weight vector Primal space Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : Points are weight vectors One constraints is induced by input data Dual space Half space = all predictors (weight vectors) that classify correctly a given input example

Weight Vector (Version) Space The algorithm forces to reside in this region

Passive Step Nothing to do. already resides on the desired side.

Aggressive Step The algorithm projects on the desired half-space

Aggressive Update Step Set to be the solution of the following optimization problem : Solution:

Perceptron vs. PA Common Update : Perceptron Passive-Aggressive

Perceptron vs. PA Margin Error No-Error, Small Margin No-Error, Large Margin Margin

Perceptron vs. PA

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Geometrical Assumption All examples are bounded in a ball of radius R

Separablity There exists a unit vector that classifies the data correctly

Perceptron’s Mistake Bound The number of mistakes the algorithm makes is bounded by Simple case: positive points negative points Separating hyperplane Bound is :

Geometrical Motivation

SGD on such data

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Second Order Perceptron Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron Assume all inputs are given Compute “whitening” matrix Run the Perceptron on “wightened” data New “whitening” matrix

Second Order Perceptron Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron Bound: Same simple case: Thus Bound is :

Second Order Perceptron Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron If No-Mistake Do nothing If Mistake Update

SGD on weightened data

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Span-based Update Rules The weight vector is a linear combination of examples Two rate schedules (many many others): Perceptron algorithm, Conservative Passive - Aggressive Weight of feature f Learning rate Learning rate Target label Either -1 or 1 Feature-value of input instance

Sentiment Classification Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended! Threshold the 0-5 stars at 3, later we change this Pang, Lee, Vaithyanathan, EMNLP 2002

Sentiment Classification Many positive reviews with the word best Wbest Later negative review “boring book – best if you want to sleep in seconds” Linear update will reduce both Wbest Wboring But best appeared more than boring The model know’s more about best than boring Better to reduce words in different rate We should take the feature statistics into consideration Given evidence (document) the information it contributes about different features is monotonic decreasing with the number of past observations of this feature Maintain confidence parameter that measure correct confidence in weight Wboring Wbest

Natural Language Processing Big datasets, large number of features Many features are only weakly correlated with target label Linear classifiers: features are associated with word-counts Heavy-tailed feature distribution Counts Many rare and weakly informative words And some very frequent words Need to take frequency into consideration Feature Rank

Natural Language Processing

New Prediction Models Gaussian distributions over weight vectors The covariance is either full or diagonal In NLP we have many features and use a diagonal covariance

Classification Given a new example Stochastic: Collective: Draw a weight vector Make a prediction Collective: Average weight vector Average margin Average prediction

The Margin is Random Variable The signed margin is random 1-d Gaussian Thus:

Linear Model  Distribution over Linear Models Mean weight-vector Each green point a a single weight that classify a given example to be in one class, and blue points are weight vectors that classify Them to be in the other class The majority goes with the mean Example

Weight Vector (Version) Space The algorithm forces that most of the values of would reside in this region

Passive Step Nothing to do, most of the weight vectors already classifies the example correctly

Aggressive Step The mean is moved beyond the mistake-line (Large Margin) The algorithm projects the current Gaussian distribution on the half-space The covariance is shrunk in the direction of the input example

Projection Update Vectors (aka PA): Distributions (New Update) : Confidence Parameter

Itakura-Saito Divergence Sum of two divergences of parameters : Convex in both arguments simultaneously Matrix Itakura-Saito Divergence Mahanabolis Distance

Constraint Probabilistic Constraint : Equivalent Margin Constraint : Convex in , concave in Solutions: Linear approximation Change variables to get a convex formulation Relax (AROW) Dredze, Crammer, Pereira. ICML 2008 Crammer, Dredze, Pereira. NIPS 2008 Crammer, Dredze, Kulesza. NIPS 2009 70

Convexity Change variables Equivalent convex formulation Crammer, Dredze, Pereira. NIPS 2008 Convexity Change variables Equivalent convex formulation

AROW PA: CW : Similar update form as CW Crammer, Dredze, Kulesza. NIPS 2009 AROW PA: CW : Similar update form as CW

The Update Optimization update can be solved analytically Coefficients depend on specific algorithm

Definitions

Updates AROW CW (Change Variables) CW (Linearization)

Per-feature Learning Rate Reducing the Learning rate and eigenvalues of covariance matrix

Diagonal Matrix Given a matrix we define to be only the diagonal part of the matrix, Make matrix diagonal Make inverse diagonal

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

(Back to) Stochastic Gradient Descent Consider the batch problem Simple algorithm: Initialize Iterate, for Pick a random index Compute Set

Adaptive Stochastic Gradient Descent Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent Consider the batch problem Simple algorithm: Initialize Iterate, for Pick a random index Compute Set

Adaptive Stochastic Gradient Descent Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent Very general! Can be used to solve with various regularizations The matrix A can be either full or diagonal Comes with convergence and regret bounds Similar performance to AROW

Adaptive Stochastic Gradient Descent Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent SGD AdaGrad

Special Case of a General Framework Orabona and Crammer, NIPS 2010 Special Case of a General Framework Any loss function Assume: Convex in first argument, non-negative Algorithm: online convex programming with shifting link function

Special Case of a General Framework Orabona and Crammer, NIPS 2010 Special Case of a General Framework

Our Algorithms as a Special Case Loss: Regularization functions

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Kernels

Proof Show that we can write Induction

Proof (cntd) By update rule : Thus

Proof (cntd) By update rule :

Proof (cntd) Thus

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Properties Eigenvalues of covariance matrix monotonically decrease Mean of signed-margin increases; variance decreases

Statistical Interpretation Margin Constraint : Distribution over weight-vectors : Assume input is corrupted with Gaussian noise

Statistical Interpretation Mean weight-vector Bad realization Input Instance Example Linear Separator Good realization Version Space Input Space

Orabona and Crammer, NIPS 2010 Mistake Bound For any reference weight vector , the number of mistakes made by AROW is upper bounded by where set of example indices with a mistake set of example indices with an update but not a mistake

Comment I Separable case and no updates: where

Comment II For large the bound becomes: When no updates are performed: Perceptron

Bound for Diagonal Algorithm Orabona and Crammer, NIPS 2010 Bound for Diagonal Algorithm No. of mistakes is bounded by Is low when either a feature is rare or non-informative Exactly as in NLP …

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Synthetic Data 20 features 2 informative (rotated skewed Gaussian) 18 noisy Using a single feature is as good as a random prediction

Synthetic Data (cntd.) Distribution after 50 examples (x1)

Synthetic Data (no noise) Perceptron PA SOP CW-full CW-diag

Synthetic Data (10% noise)

Outline Background: Second-Order Algorithms Properties Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data

Data Sentiment Reuters, pairs of labels Sentiment reviews from 6 Amazon domains (Blitzer et al) Classify a product review as either positive or negative Reuters, pairs of labels Three divisions: Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail. Bag of words representation with binary features. 20 News Groups, pairs of labels comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.

Experimental Design Online to batch : Multiple passes over the training data Evaluate on a different test set after each pass Compute error/accuracy Set parameter using held-out data 10 Fold Cross-Validation ~2000 instances per problem Balanced class-labels

Results vs Online- Sentiment StdDev and Variance – always better than baseline Variance – 5/6 significantly better

Results vs Online – 20NG + Reuters StdDev and Variance – always better than baseline Variance – 4/6 significantly better

Results vs Batch - Sentiment always better than batch methods 3/6 significantly better

Results vs Batch - 20NG + Reuters 5/6 better than batch methods 3/5 significantly better, 1/1 significantly worse

Passes of Training Data Results - Sentiment O PA O CW O PA O CW O PA O CW Accuracy O PA O CW O PA O CW O PA O CW Passes of Training Data CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes

Passes of Training Data Results – Reuters + 20NG O PA O CW O PA O CW O PA O CW Accuracy O PA O CW O PA O CW O PA O CW Passes of Training Data CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes

Error Reduction by Multiple Passes PA benefits more from multiple passes (8/12) Amount of benefit is data dependent

Bayesian Logistic Regression T. Jaakkola and M. Jordan. 1997 Bayesian Logistic Regression BLR CW/AROW Covariance Mean Covariance Mean Based on the Variational Approximation Conceptually decoupled update Function of the margin/hinge-loss 118

Algorithms Summary Different motivation, similar algorithms 2nd Order 1st Order SOP Perceptron CW+AROW PA AdaGrad SGD LR Logisitic Regression Different motivation, similar algorithms All algorithms can be kernelized Work well for data NOT isotropic / symmetric State-of-the-art results in various domains Accompanied with theory