Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Lecture 14 – Neural Networks
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Crash Course on Machine Learning
Learning with large datasets Machine Learning Large scale machine learning.
SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.
Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Using Large-Vocabulary Classifiers William W. Cohen.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Other Map-Reduce (ish) Frameworks William Cohen 1.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Logistic Regression Week 3 – Soft Computing By Yosi Kristian.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Scaling up: Naïve Bayes on Hadoop What, How and Why?
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
Efficient Logistic Regression with Stochastic Gradient Descent
Phrase Finding; Stream-and-Sort to “Request-and-Answer” William W. Cohen.
Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Matt Gormley Lecture 4 September 12, 2016
Deep Feedforward Networks
Announcements Working AWS codes will be out soon
Machine Learning Basics
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
10-605: Map-Reduce Workflows
Logistic Regression Classification Machine Learning.
Lecture 8 Generalized Linear Models &
MapReduce Algorithm Design
Linear Regression.
Logistic Regression Classification Machine Learning.
Logistic Regression & Parallel SGD
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Biointelligence Laboratory, Seoul National University
Introduction to MapReduce
Big ML.
Softmax Classifier.
Artificial Intelligence 10. Neural Networks
Logistic Regression Classification Machine Learning.
Logistic Regression Classification Machine Learning.
Map Reduce, Types, Formats and Features
Presentation transcript:

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Reminder: Your map-reduce assignments are mostly done Old NB learning Stream & sort Stream – sort – aggregate – Counter update “message” – Optimization: in-memory hash, periodically emptied – Sort of messages – Logic to aggregate them Workflow is on one machine New NB learning Map-reduce Map – shuffle - reduce – Counter update Map – Combiner – (Hidden) shuffle & sort – Sum-reduce Workflow is done in parallel

NB implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tee words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests id 1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 w 2,1 w 2,2 w 2,3 …. id 3 w 3,1 w 3,2 …. id 4 w 4,1 w 4,2 … id 5 w 5,1 w 5,2 …... X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … … train.dat counts.dat Map to generate counter updates + Sum combiner + Sum reducer

Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tee words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests words.dat wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 Map Reduce

Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tee words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests wCounts aardvarkC[w^Y=sports]=2 agent… … zynga … found~ctr to id 1 aardvark ~ctr to id 2 … today ~ctr to id i … wCounts aardvarkC[w^Y=sports]=2 aardvark~ctr to id1 agentC[w^Y=sports]=… agent~ctr to id345 agent~ctr to id9854 …~ctr to id345 agent~ctr to id34742 … zyngaC[…] zynga~ctr to id1 output looks like this input looks like this words.dat Map + IdentityReduce; save output in temp files Identity Map with two sets of inputs; custom secondary sort (used in shuffle/sort) Reduce, output to temp

Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tees words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. … id 1 found an aardvark in zynga’s farmville today! id 2 … id 3 …. id 4 … id 5 ….. Output looks like this test.dat Identity Map with two sets of inputs; custom secondary sort

Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tees words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests Input looks like this KeyValue id1found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … id2w 2,1 w 2,2 w 2,3 …. ~ctr for w 2,1 is … …… Reduce

Reminder: The map-reduce assignments are mostly done Old NB testing Request-answer – Produce requests – Sort requests and records together – Send answers to requestors – Sort answers and sending entities together – … – Reprocess input with request answers Workflows are isomorphic New NB testing Two map-reduces – Produce requests w/ Map – (Hidden) shuffle & sort with custom secondary sort – Reduce, which sees records first, then requests – Identity map with two inputs – Custom secondary sort – Reduce that sees old input first, then answers Workflows are isomporphic

Outline Reminder about early/next assignments Logistic regression and SGD – Learning as optimization – Logistic regression: a linear classifier optimizing P(y|x) – Stochastic gradient descent “streaming optimization” for ML problems – Regularized logistic regression – Sparse regularized logistic regression – Memory-saving logistic regression

Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1 MLE estimate of θ, Pr(x i =1) – #[flips where x i =1]/#[flips x i ] = k/n Now: reformulate as optimization…

Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1, k of them are 1 Easier to optimize:

Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1, k are 1

Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1, k of them are 1 = 0 θ = 0 θ = 1 k- kθ – nθ + kθ = 0  nθ = k  θ = k/n

Learning as optimization: general procedure Goal: Learn the parameter θ of a classifier – probably θ is a vector Dataset: D={x 1,…,x n } Write down Pr(D|θ) as a function of θ Maximize by differentiating and setting to zero