Efficient Logistic Regression with Stochastic Gradient Descent William Cohen
Reminder: Your map-reduce assignments are mostly done Old NB learning Stream & sort Stream – sort – aggregate – Counter update “message” – Optimization: in-memory hash, periodically emptied – Sort of messages – Logic to aggregate them Workflow is on one machine New NB learning Map-reduce Map – shuffle - reduce – Counter update Map – Combiner – (Hidden) shuffle & sort – Sum-reduce Workflow is done in parallel
NB implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tee words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests id 1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 w 2,1 w 2,2 w 2,3 …. id 3 w 3,1 w 3,2 …. id 4 w 4,1 w 4,2 … id 5 w 5,1 w 5,2 …... X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … … train.dat counts.dat Map to generate counter updates + Sum combiner + Sum reducer
Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tee words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests words.dat wCounts associated with W aardvarkC[w^Y=sports]=2 agentC[w^Y=sports]=1027,C[w^Y=worldNews]=564 …… zyngaC[w^Y=sports]=21,C[w^Y=worldNews]=4464 Map Reduce
Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tee words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests wCounts aardvarkC[w^Y=sports]=2 agent… … zynga … found~ctr to id 1 aardvark ~ctr to id 2 … today ~ctr to id i … wCounts aardvarkC[w^Y=sports]=2 aardvark~ctr to id1 agentC[w^Y=sports]=… agent~ctr to id345 agent~ctr to id9854 …~ctr to id345 agent~ctr to id34742 … zyngaC[…] zynga~ctr to id1 output looks like this input looks like this words.dat Map + IdentityReduce; save output in temp files Identity Map with two sets of inputs; custom secondary sort (used in shuffle/sort) Reduce, output to temp
Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tees words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. … id 1 found an aardvark in zynga’s farmville today! id 2 … id 3 …. id 4 … id 5 ….. Output looks like this test.dat Identity Map with two sets of inputs; custom secondary sort
Implementation summary java CountForNB train.dat … > eventCounts.dat java CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat java requestWordCounts test.dat | tees words.dat | sort | java answerWordCountRequests | tee test.dat| sort | testNBUsingRequests Input looks like this KeyValue id1found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … id2w 2,1 w 2,2 w 2,3 …. ~ctr for w 2,1 is … …… Reduce
Reminder: The map-reduce assignments are mostly done Old NB testing Request-answer – Produce requests – Sort requests and records together – Send answers to requestors – Sort answers and sending entities together – … – Reprocess input with request answers Workflows are isomorphic New NB testing Two map-reduces – Produce requests w/ Map – (Hidden) shuffle & sort with custom secondary sort – Reduce, which sees records first, then requests – Identity map with two inputs – Custom secondary sort – Reduce that sees old input first, then answers Workflows are isomporphic
Outline Reminder about early/next assignments Logistic regression and SGD – Learning as optimization – Logistic regression: a linear classifier optimizing P(y|x) – Stochastic gradient descent “streaming optimization” for ML problems – Regularized logistic regression – Sparse regularized logistic regression – Memory-saving logistic regression
Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1 MLE estimate of θ, Pr(x i =1) – #[flips where x i =1]/#[flips x i ] = k/n Now: reformulate as optimization…
Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1, k of them are 1 Easier to optimize:
Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1, k are 1
Learning as optimization: warmup Goal: Learn the parameter θ of a binomial Dataset: D={x 1,…,x n }, x i is 0 or 1, k of them are 1 = 0 θ = 0 θ = 1 k- kθ – nθ + kθ = 0 nθ = k θ = k/n
Learning as optimization: general procedure Goal: Learn the parameter θ of a classifier – probably θ is a vector Dataset: D={x 1,…,x n } Write down Pr(D|θ) as a function of θ Maximize by differentiating and setting to zero