Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

Lecture 9 Support Vector Machines

Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Linear Separators.
SVM—Support Vector Machines
Pattern Recognition and Machine Learning: Kernel Methods.
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Randomized Algorithms William Cohen. Outline SGD with the hash trick (review) Background on other randomized algorithms – Bloom filters – Locality sensitive.
Support Vector Machine
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Support Vector Machines and The Kernel Trick William Cohen
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Support Vector Machines and Kernel Methods
Support Vector Machines
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Lecture 10: Support Vector Machines
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Protein Classification Using Averaged Perceptron SVM
Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.
Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
KERNELS AND PERCEPTRONS. The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1.
ADVANCED TOPIC: KERNELS 1. The kernel trick where i 1,…,i k are the mistakes… so: Remember in our alternate perceptron:
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Efficient Logistic Regression with Stochastic Gradient Descent
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga William Cohen.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Perceptrons – the story continues. On-line learning/regret analysis Optimization – is a great model of what you want to do – a less good model of what.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Neural networks and support vector machines
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
Announcements Next week:
Efficient Logistic Regression with Stochastic Gradient Descent
Support Vector Machines

CSCI B609: “Foundations of Data Science”
Support Vector Machines
Recitation 6: Kernel SVM
Maria Florina Balcan 03/04/2010
CS480/680: Intro to ML Lecture 09: Kernels 23/02/2019 Yao-Liang Yu.
The Voted Perceptron for Ranking and Structured Classification
Introduction to Machine Learning
Presentation transcript:

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1

A FEW MORE COMMENTS ON SPARSIFYING THE LOGISTIC REGRESSION UPDATE 2

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T – For each example (x i,y i ) p i = … For each feature W[j] – W[j] = W[j] - λ2μW[j] – If x i j >0 then » W[j] = W[j] + λ(y i - p i )x j 3

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T – For each example (x i,y i ) p i = … For each feature W[j] – W[j] *= (1 - λ2μ) – If x i j >0 then » W[j] = W[j] + λ(y i - p i )x j 4

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T – For each example (x i,y i ) p i = … For each feature W[j] – If x i j >0 then » W[j] *= (1 - λ2μ) A » W[j] = W[j] + λ(y i - p i )x j A is number of examples seen since the last time we did an x>0 update on W[j] 5

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T – For each example (x i,y i ) p i = … ; k++ For each feature W[j] – If x i j >0 then » W[j] *= (1 - λ2μ) k-A[j] » W[j] = W[j] + λ(y i - p i )x j » A[j] = k k-A[j] is number of examples seen since the last time we did an x>0 update on W[j] 6

Comments Same trick can be applied in other contexts – Other regularizers (eg L1, …) – Conjugate gradient (Langford) – FTRL (Follow the regularized leader) – Voted perceptron averaging – …? 7

SPARSIFYING THE AVERAGED PERCEPTRON UPDATE 8

Complexity of perceptron learning Algorithm: v=0 for each example x,y: – if sign(v.x) != y v = v + yx init hashtable for x i !=0, v i += yx i O(n) O(|x|)=O(|d|)

Complexity of averaged perceptron Algorithm: vk=0 va = 0 for each example x,y: – if sign(vk.x) != y va = va + vk vk = vk + yx mk = 1 – else nk++ init hashtables for vk i !=0, va i += vk i for x i !=0, v i += yx i O(n) O(n|V|) O(|x|)=O(|d|) O(|V|)

Alternative averaged perceptron Algorithm: vk = 0 va = 0 for each example x,y: – va = va + vk – t = t+1 – if sign(vk.x) != y vk = vk + y*x Return va/t S k is the set of examples including the first k mistakes Observe :

Alternative averaged perceptron Algorithm: vk = 0 va = 0 for each example x,y: – va = va + vk – t = t+1 – if sign(vk.x) != y vk = vk + y*x Return va/t So when there’s a mistake at time t on x,y: y*x is added to va on every subsequent iteration Suppose you know T, the total number of examples in the stream…

Alternative averaged perceptron Algorithm: vk = 0 va = 0 for each example x,y: – va = va + vk – t = t+1 – if sign(vk.x) != y vk = vk + y*x va = va + (T-t)*y*x Return va/T T = the total number of examples in the stream… All subsequent additions of x to va Unpublished? I figured this out recently, Leon Bottou knows it too

Formalization of the “Hash Trick”: First: Review of Kernels 14

The kernel perceptron A B instance x i Compute: y i = v k. x i ^ y i ^ If mistake: v k+1 = v k + y i x i Mathematically the same as before … but allows use of the kernel trick 15

The kernel perceptron A B instance x i Compute: y i = v k. x i ^ y i ^ If mistake: v k+1 = v k + y i x i Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values. 16

Some common kernels Linear kernel: Polynomial kernel: Gaussian kernel: More later…. 17

Kernels 101 Duality –and computational properties –Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties 18

Kernels 101 Duality: two ways to look at this Two different computational ways of getting the same behavior Explicitly map from x to φ(x) – i.e. to the point corresponding to x in the Hilbert space Implicitly map from x to φ(x) by changing the kernel function K 19

Kernels 101 Duality Gram matrix: K: k ij = K(x i,x j ) K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi- definite”  z T K z >= 0 for all z 20

Kernels 101 Duality Gram matrix: K: k ij = K(x i,x j ) K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  …  z T K z >= 0 for all z Fun fact: Gram matrix positive semi-definite  K(x i,x j )= φ(x i ), φ(x j ) for some φ Proof: φ(x) uses the eigenvectors of K to represent x 21

HASH KERNELS AND “THE HASH TRICK” 22

Question Most of the weights in a classifier are – small and not important 23

24

Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion) 25

Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion) 26

Some details I.e. – a hashed vector is probably close to the original vector 27

Some details I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work 28

Some details 29

The hash kernel: implementation One problem: debugging is harder – Features are no longer meaningful – There’s a new way to ruin a classifier Change the hash function  You can separately compute the set of all words that hash to h and guess what features mean – Build an inverted index h  w1,w2,…, 30

A variant of feature hashing Hash each feature multiple times with different hash functions Now, each w has k chances to not collide with another useful w’ An easy way to get multiple hash functions – Generate some random strings s 1,…,s L – Let the k-th hash function for w be the ordinary hash of concatenation w  s k 31

A variant of feature hashing Why would this work? Claim: with 100,000 features and 100,000,000 buckets: – k=1  Pr(any duplication) ≈1 – k=2  Pr(any duplication) ≈0.4 – k=3  Pr(any duplication) ≈

Experiments RCV1 dataset Use hash kernels with TF-approx-IDF representation – TF and IDF done at level of hashed features, not words – IDF is approximated from subset of data Shi et al, JMLR

Results 34

Results 35

Results 36