Download presentation
Published byTomas Chisley Modified over 10 years ago
1
Koby Crammer Department of Electrical Engineering
Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD Prague
2
Thanks Mark Dredze Alex Kulesza Avihai Mejer Edward Moroshko
Francesco Orabona Fernando Pereira Yoram Singer Nina Vaitz
3
Tutorial Context Online Learning SVMs Tutorial Optimization Theory
Real-World Data
4
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
5
Online Learning Tyrannosaurus rex
6
Online Learning Triceratops
7
Online Learning Velocireptor Tyrannosaurus rex
8
Formal Setting – Binary Classification
Instances Images, Sentences Labels Parse tree, Names Prediction rule Linear predictions rules Loss No. of mistakes
9
Predictions Discrete Predictions: Continuous predictions :
Hard to optimize Continuous predictions : Label Confidence
10
Loss Functions Natural Loss: Real-valued-predictions loss:
Zero-One loss: Real-valued-predictions loss: Hinge loss: Exponential loss (Boosting) Log loss (Max Entropy, Boosting)
11
Loss Functions Hinge Loss Zero-One Loss 1 1
12
Online Framework Initialize Classifier Algorithm works in rounds
On round the online algorithm : Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule Goal : Suffer small cumulative loss
13
Online Learning Maintain Model M Get Instance x Update Model
Predict Label y=M(x) M For example a linear model We make prediction by computing the inner product Output is a bit (threshold) or can be a real number Loss can be 1 if an error or 0 otherwise, but can also use quadratic loss and others Change the weights e.g. by adding the input x times a scalar Suffer Loss l(y,y) Get True Label y
14
Linear Classifiers Any Features W.l.o.g.
Binary Classifiers of the form Notation Abuse
15
Linear Classifiers (cntd.)
Prediction : Confidence in prediction:
16
Linear Classifiers Input Instance to be classified
Each instance is described by a finite set of numeric (float or integer) features To make prediction we weight them according the model, sum and take a threshold Duality between input and model Weight vector of classifier
17
Margin Margin of an example with respect to the classifier : Note :
The set is separable iff there exists such that
18
Geometrical Interpretation
19
Geometrical Interpretation
20
Geometrical Interpretation
21
Geometrical Interpretation
Margin <<0 Margin >0 Margin >>0 Margin <0
22
Hinge Loss
23
Why Online Learning? Fast
Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms
24
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
25
The Perceptron Algorithm
Rosenblat 1958 The Perceptron Algorithm If No-Mistake Do nothing If Mistake Update Margin after update :
26
Geometrical Interpretation
27
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
28
Gradient Descent Consider the batch problem Simple algorithm:
Initialize Iterate, for Compute Set
30
Stochastic Gradient Descent
Consider the batch problem Simple algorithm: Initialize Iterate, for Pick a random index Compute Set
32
Stochastic Gradient Descent
“Hinge” loss The gradient Simple algorithm: Initialize Iterate, for Pick a random index If then else Set The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples
33
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
34
Motivation Perceptron: No guaranties of margin after the update
PA :Enforce a minimal non-zero margin after the update In particular : If the margin is large enough (1), then do nothing If the margin is less then unit, update such that the margin after the update is enforced to be unit
35
Input Space
36
Input Space vs. Version Space
Points are input data One constraint is induced by weight vector Primal space Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : Points are weight vectors One constraints is induced by input data Dual space Half space = all predictors (weight vectors) that classify correctly a given input example
37
Weight Vector (Version) Space
The algorithm forces to reside in this region
38
Passive Step Nothing to do. already resides on the desired side.
39
Aggressive Step The algorithm projects on the desired half-space
40
Aggressive Update Step
Set to be the solution of the following optimization problem : Solution:
41
Perceptron vs. PA Common Update : Perceptron Passive-Aggressive
42
Perceptron vs. PA Margin Error No-Error, Small Margin
No-Error, Large Margin Margin
43
Perceptron vs. PA
44
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
45
Geometrical Assumption
All examples are bounded in a ball of radius R
46
Separablity There exists a unit vector that classifies the data correctly
47
Perceptron’s Mistake Bound
The number of mistakes the algorithm makes is bounded by Simple case: positive points negative points Separating hyperplane Bound is :
48
Geometrical Motivation
49
SGD on such data
50
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
51
Second Order Perceptron
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron Assume all inputs are given Compute “whitening” matrix Run the Perceptron on “wightened” data New “whitening” matrix
52
Second Order Perceptron
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron Bound: Same simple case: Thus Bound is :
53
Second Order Perceptron
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron If No-Mistake Do nothing If Mistake Update
54
SGD on weightened data
55
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
56
Span-based Update Rules
The weight vector is a linear combination of examples Two rate schedules (many many others): Perceptron algorithm, Conservative Passive - Aggressive Weight of feature f Learning rate Learning rate Target label Either -1 or 1 Feature-value of input instance
57
Sentiment Classification
Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … … Very highly recommended! Threshold the 0-5 stars at 3, later we change this Pang, Lee, Vaithyanathan, EMNLP 2002
58
Sentiment Classification
Many positive reviews with the word best Wbest Later negative review “boring book – best if you want to sleep in seconds” Linear update will reduce both Wbest Wboring But best appeared more than boring The model know’s more about best than boring Better to reduce words in different rate We should take the feature statistics into consideration Given evidence (document) the information it contributes about different features is monotonic decreasing with the number of past observations of this feature Maintain confidence parameter that measure correct confidence in weight Wboring Wbest
59
Natural Language Processing
Big datasets, large number of features Many features are only weakly correlated with target label Linear classifiers: features are associated with word-counts Heavy-tailed feature distribution Counts Many rare and weakly informative words And some very frequent words Need to take frequency into consideration Feature Rank
60
Natural Language Processing
61
New Prediction Models Gaussian distributions over weight vectors
The covariance is either full or diagonal In NLP we have many features and use a diagonal covariance
62
Classification Given a new example Stochastic: Collective:
Draw a weight vector Make a prediction Collective: Average weight vector Average margin Average prediction
63
The Margin is Random Variable
The signed margin is random 1-d Gaussian Thus:
64
Linear Model Distribution over Linear Models
Mean weight-vector Each green point a a single weight that classify a given example to be in one class, and blue points are weight vectors that classify Them to be in the other class The majority goes with the mean Example
65
Weight Vector (Version) Space
The algorithm forces that most of the values of would reside in this region
66
Passive Step Nothing to do, most of the weight vectors already classifies the example correctly
67
Aggressive Step The mean is moved beyond the mistake-line
(Large Margin) The algorithm projects the current Gaussian distribution on the half-space The covariance is shrunk in the direction of the input example
68
Projection Update Vectors (aka PA): Distributions (New Update) :
Confidence Parameter
69
Itakura-Saito Divergence
Sum of two divergences of parameters : Convex in both arguments simultaneously Matrix Itakura-Saito Divergence Mahanabolis Distance
70
Constraint Probabilistic Constraint : Equivalent Margin Constraint :
Convex in , concave in Solutions: Linear approximation Change variables to get a convex formulation Relax (AROW) Dredze, Crammer, Pereira. ICML 2008 Crammer, Dredze, Pereira. NIPS 2008 Crammer, Dredze, Kulesza. NIPS 2009 70
71
Convexity Change variables Equivalent convex formulation
Crammer, Dredze, Pereira. NIPS 2008 Convexity Change variables Equivalent convex formulation
72
AROW PA: CW : Similar update form as CW
Crammer, Dredze, Kulesza. NIPS 2009 AROW PA: CW : Similar update form as CW
73
The Update Optimization update can be solved analytically
Coefficients depend on specific algorithm
74
Definitions
75
Updates AROW CW (Change Variables) CW (Linearization)
76
Per-feature Learning Rate
Reducing the Learning rate and eigenvalues of covariance matrix
77
Diagonal Matrix Given a matrix we define to be only the diagonal part of the matrix, Make matrix diagonal Make inverse diagonal
78
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
79
(Back to) Stochastic Gradient Descent
Consider the batch problem Simple algorithm: Initialize Iterate, for Pick a random index Compute Set
80
Adaptive Stochastic Gradient Descent
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent Consider the batch problem Simple algorithm: Initialize Iterate, for Pick a random index Compute Set
81
Adaptive Stochastic Gradient Descent
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent Very general! Can be used to solve with various regularizations The matrix A can be either full or diagonal Comes with convergence and regret bounds Similar performance to AROW
82
Adaptive Stochastic Gradient Descent
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent SGD AdaGrad
83
Special Case of a General Framework
Orabona and Crammer, NIPS 2010 Special Case of a General Framework Any loss function Assume: Convex in first argument, non-negative Algorithm: online convex programming with shifting link function
84
Special Case of a General Framework
Orabona and Crammer, NIPS 2010 Special Case of a General Framework
85
Our Algorithms as a Special Case
Loss: Regularization functions
86
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
87
Kernels
88
Proof Show that we can write Induction
89
Proof (cntd) By update rule : Thus
90
Proof (cntd) By update rule :
91
Proof (cntd) Thus
92
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
93
Properties Eigenvalues of covariance matrix monotonically decrease
Mean of signed-margin increases; variance decreases
94
Statistical Interpretation
Margin Constraint : Distribution over weight-vectors : Assume input is corrupted with Gaussian noise
95
Statistical Interpretation
Mean weight-vector Bad realization Input Instance Example Linear Separator Good realization Version Space Input Space
96
Orabona and Crammer, NIPS 2010
Mistake Bound For any reference weight vector , the number of mistakes made by AROW is upper bounded by where set of example indices with a mistake set of example indices with an update but not a mistake
97
Comment I Separable case and no updates: where
98
Comment II For large the bound becomes:
When no updates are performed: Perceptron
99
Bound for Diagonal Algorithm
Orabona and Crammer, NIPS 2010 Bound for Diagonal Algorithm No. of mistakes is bounded by Is low when either a feature is rare or non-informative Exactly as in NLP …
100
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
101
Synthetic Data 20 features 2 informative (rotated skewed Gaussian)
18 noisy Using a single feature is as good as a random prediction
102
Synthetic Data (cntd.) Distribution after 50 examples (x1)
103
Synthetic Data (no noise)
Perceptron PA SOP CW-full CW-diag
104
Synthetic Data (10% noise)
105
Outline Background: Second-Order Algorithms Properties
Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive Second-Order Algorithms Second order Perceptron Confidence-Weighted and AROW AdaGrad Properties Kernels Analysis Empirical Evaluation Synthetic Real Data
106
Data Sentiment Reuters, pairs of labels
Sentiment reviews from 6 Amazon domains (Blitzer et al) Classify a product review as either positive or negative Reuters, pairs of labels Three divisions: Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail. Bag of words representation with binary features. 20 News Groups, pairs of labels comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.
107
Experimental Design Online to batch :
Multiple passes over the training data Evaluate on a different test set after each pass Compute error/accuracy Set parameter using held-out data 10 Fold Cross-Validation ~2000 instances per problem Balanced class-labels
108
Results vs Online- Sentiment
StdDev and Variance – always better than baseline Variance – 5/6 significantly better
109
Results vs Online – 20NG + Reuters
StdDev and Variance – always better than baseline Variance – 4/6 significantly better
110
Results vs Batch - Sentiment
always better than batch methods 3/6 significantly better
111
Results vs Batch - 20NG + Reuters
5/6 better than batch methods 3/5 significantly better, 1/1 significantly worse
115
Passes of Training Data
Results - Sentiment O PA O CW O PA O CW O PA O CW Accuracy O PA O CW O PA O CW O PA O CW Passes of Training Data CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes
116
Passes of Training Data
Results – Reuters + 20NG O PA O CW O PA O CW O PA O CW Accuracy O PA O CW O PA O CW O PA O CW Passes of Training Data CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes
117
Error Reduction by Multiple Passes
PA benefits more from multiple passes (8/12) Amount of benefit is data dependent
118
Bayesian Logistic Regression
T. Jaakkola and M. Jordan. 1997 Bayesian Logistic Regression BLR CW/AROW Covariance Mean Covariance Mean Based on the Variational Approximation Conceptually decoupled update Function of the margin/hinge-loss 118
119
Algorithms Summary Different motivation, similar algorithms
2nd Order 1st Order SOP Perceptron CW+AROW PA AdaGrad SGD LR Logisitic Regression Different motivation, similar algorithms All algorithms can be kernelized Work well for data NOT isotropic / symmetric State-of-the-art results in various domains Accompanied with theory
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.