Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.

Slides:

Advertisements

Similar presentations

Overview of this week Debugging tips for ML algorithms

Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Regularization David Kauchak CS 451 – Fall 2013.

CS479/679 Pattern Recognition Dr. George Bebis

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.

Optimization Tutorial

Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.

CMPUT 466/551 Principal Source: CMU

Randomized Algorithms William Cohen. Outline SGD with the hash trick (review) Background on other randomized algorithms – Bloom filters – Locality sensitive.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Sparse vs. Ensemble Approaches to Supervised Learning

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.

Visual Recognition Tutorial

Sparse vs. Ensemble Approaches to Supervised Learning

Distributed Representations of Sentences and Documents

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.

Collaborative Filtering Matrix Factorization Approach

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Data mining and machine learning A brief introduction.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,

Mathematical formulation XIAO LIYING. Mathematical formulation.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.

Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Class Imbalance in Text Classification

Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Logistic Regression William Cohen.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

Efficient Logistic Regression with Stochastic Gradient Descent

Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga William Cohen.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

Matt Gormley Lecture 11 October 5, 2016

Matt Gormley Lecture 4 September 12, 2016

Announcements Next week:

Efficient Logistic Regression with Stochastic Gradient Descent

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

ECE 5424: Introduction to Machine Learning

Statistical Learning Dong Liu Dept. EEIS, USTC.

Collaborative Filtering Matrix Factorization Approach

Logistic Regression & Parallel SGD

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.

Recap: Naïve Bayes classifier

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

Reinforcement Learning (2)

Reinforcement Learning (2)

Logistic Regression Geoff Hulten.

Presentation transcript:

Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen

Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1 = 0 θ= 0 θ= 1 k- kθ – nθ + kθ = 0 nθ = k θ = k/n

Learning as optimization: general procedure Goal: Learn the parameter θ of a classifier probably θ is a vector Dataset: D={x1,…,xn} Write down Pr(D|θ) as a function of θ Maximize by differentiating and setting to zero

Learning as optimization: general procedure Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ Maybe easier to work with log Pr(D|θ) Maximize by differentiating and setting to zero Or, use gradient descent: repeatedly take a small step in the direction of the gradient

Learning as optimization: general procedure for SGD Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ Big-data problem: we don’t want to load all the data D into memory, and the gradient depends on all the data Solution: pick a small subset of examples B<<D approximate the gradient using them “on average” this is the right direction take a step in that direction repeat…. B = one example is a very popular choice

Learning as optimization for logistic regression Goal: Learn the parameter θ of a classifier Which classifier? We’ve seen y = sign(x . w) but sign is not continuous… Convenient alternative: replace sign with the logistic function

Learning as optimization for logistic regression Goal: Learn the parameter w of the classifier Probability of a single example (y,x) would be Or with logs:

p

An observation Key computational point: if xj=0 then the gradient of wj is zero so when processing an example you only need to update weights for the non-zero features of an example.

Learning as optimization for logistic regression Final algorithm: -- do this in random order

Another observation Consider averaging the gradient over all the examples D={(x1,y1)…,(xn , yn)}

Learning as optimization for logistic regression Goal: Learn the parameter θ of a classifier Which classifier? We’ve seen y = sign(x . w) but sign is not continuous… Convenient alternative: replace sign with the logistic function Practical problem: this overfits badly with sparse features e.g., if wj is only in positive examples, its gradient is always positive !

Outline Logistic regression and SGD Learning as optimization a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Regularized logistic regression Replace LCL with LCL + penalty for large weights, eg So: becomes:

Regularized logistic regression Replace LCL with LCL + penalty for large weights, eg So the update for wj becomes: Or

Learning as optimization for logistic regression Algorithm: -- do this in random order

Learning as optimization for regularized logistic regression Algorithm: Time goes from O(nT) to O(nmT) where n = number of non-zero entries, m = number of features, T = number of passes over data

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] W[j] = W[j] - λ2μW[j] If xij>0 then W[j] = W[j] + λ(yi - pi)xj

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] W[j] *= (1 - λ2μ) If xij>0 then W[j] = W[j] + λ(yi - pi)xj

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)A W[j] = W[j] + λ(yi - pi)xj A is number of examples seen since the last time we did an x>0 update on W[j]

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k k-A[j] is number of examples seen since the last time we did an x>0 update on W[j]

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k Time goes from O(nmT) back to to O(nT) where n = number of non-zero entries, m = number of features, T = number of passes over data memory use doubles (m  2m)

Outline [other stuff] Logistic regression and SGD Learning as optimization Logistic regression: a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Pop quiz In text classification most words are rare not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Pop quiz In text classification most bigrams are rare not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Pop quiz Most of the weights in a classifier are important not important

How can we exploit this? One idea: combine uncommon words together randomly Examples: replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy” replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy” replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor” … For Naïve Bayes this breaks independence assumptions it’s not obviously a problem for logistic regression, though I could combine two low-weight words (won’t matter much) a low-weight and a high-weight word (won’t matter much) two high-weight words (not very likely to happen) How much of this can I get away with? certainly a little is it enough to make a difference? how much memory does it save?

How can we exploit this? Another observation: the values in my hash table are weights the keys in my hash table are strings for the feature names We need them to avoid collisions But maybe we don’t care about collisions? Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Learning as optimization for regularized logistic regression Algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature j: xij>0: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k ???

A variant of feature hashing Hash each feature multiple times with different hash functions Easy way to get multiple hash functions Generate some random strings s1,…,sL Let the k-th hash function for w be the ordinary hash of concatenation wsi now, each w has k chances to not collide with another useful w’

ICML 2009

An interesting example Spam filtering for Yahoo mail Lots of examples and lots of users Two options: one filter for everyone—but users disagree one filter for each user—but some users are lazy and don’t label anything Third option: classify (msg,user) pairs features of message i are words wi,1,…,wi,ki feature of user is his/her id u features of pair are: wi,1,…,wi,ki and uwi,1,…,uwi,ki based on an idea by Hal Daumé

An example E.g., this email to wcohen features: dear, madam, sir,…. investment, broker,…, wcohendear, wcohenmadam, wcohen,…, idea: the learner will figure out how to personalize my spam filter by using the wcohenX features

An example Compute personalized features and multiple hashes on-the-fly: a great opportunity to use several processors and speed up i/o

Experiments 3.2M emails 40M tokens 430k users 16T unique features – after personalization

An example 2^26 entries = 1 Gb @ 8bytes/weight 3,2M emails 400k users 40M tokens