10-405 Big ML
Scalable out-of-core classification (of large test sets) can we do better that the current approach?
Testing Large-vocab Naïve Bayes [For assignment] For each example id, y, x1,….,xd in train: Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x1,….,xd in test: Add x1,….,xd to NEEDED For each event, C(event) in the summed counters If event involves a NEEDED term x read it into C For each y’ in dom(Y): Compute log Pr(y’,x1,….,xd) = …. Horrible Kludge assumes test set fits in memory
We have to move a lot of data around, is it even possible with only stream-and-sort? Can we do better? Test data Event counts id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … 5245 1054 2120 37 3 … id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…] C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…] C[X=w3,1^Y=….]=… … What we’d like
Can we do better? Event counts X=w1^Y=sports X=w1^Y=worldNews X=.. … 5245 1054 2120 37 3 … Step 1: group counters by word w How: Stream and sort: for each C[X=w^Y=y]=n print “w C[Y=y]=n” sort and build a list of values associated with each key w Like an inverted index w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
If these records were in a key-value DB we would know what to do…. Test data Record of all event counts for each word id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 Step 2: stream through and for each test case idi wi,1 wi,2 wi,3 …. wi,ki request the event counters needed to classify idi from the event-count DB, then classify using the answers Classification logic
Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 Step 2: stream through and for each test case idi wi,1 wi,2 wi,3 …. wi,ki request the event counters needed to classify idi from the event-count DB, then classify using the answers Classification logic
Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 For naïve Bayes training we turned non-local operations (increment counters) into messages; then sorted the messages to get locality; and finally (sort of) executed the sorted operations.. Classification logic
Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 W1,1 counters to id1 W1,2 counters to id1 … Wi,j counters to idi Classification logic
Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id1 found an aardvark in zynga’s farmville today! id2 … id3 …. id4 … id5 … .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 found ctrs to id1 aardvark ctrs to id1 … today ctrs to id1 Classification logic
Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id1 found an aardvark in zynga’s farmville today! id2 … id3 …. id4 … id5 … .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 ~ is the last ascii character % export LC_COLLATE=C means that it will sort after anything else with unix sort found ~ctrs to id1 aardvark ~ctrs to id1 … today ~ctrs to id1 Classification logic
Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id1 found an aardvark in zynga’s farmville today! id2 … id3 …. id4 … id5 … .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 Counter records found ~ctr to id1 aardvark ~ctr to id2 … today ~ctr to idi Classification logic Combine and sort requests
A stream-and-sort analog of the request-and-answer pattern… Counts aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]=… ~ctr to id345 ~ctr to id9854 … ~ctr to id34742 zynga C[…] Record of all event counts for each word w Counts aardvark C[w^Y=sports]=2 agent … zynga Counter records found ~ctr to id1 aardvark ~ctr to id1 … today ~ctr to id1 Combine and sort Request-handling logic requests
A stream-and-sort analog of the request-and-answer pattern… Counts aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]=… ~ctr to id345 ~ctr to id9854 … ~ctr to id34742 zynga C[…] previousKey = somethingImpossible For each (key,val) in input: If key==previousKey Answer(recordForPrevKey,val) Else previousKey = key recordForPrevKey = val define Answer(record,request): find id where “request = ~ctr to id” print “id ~ctr for request is record” Combine and sort Request-handling logic requests
A stream-and-sort analog of the request-and-answer pattern… Counts aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]=… ~ctr to id345 ~ctr to id9854 … ~ctr to id34742 zynga C[…] previousKey = somethingImpossible For each (key,val) in input: If key==previousKey Answer(recordForPrevKey,val) Else previousKey = key recordForPrevKey = val define Answer(record,request): find id where “request = ~ctr to id” print “id ~ctr for request is record” Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. Combine and sort Request-handling logic requests
A stream-and-sort analog of the request-and-answer pattern… Counts aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]=… ~ctr to id345 ~ctr to id9854 … ~ctr to id34742 zynga C[…] Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. id1 found an aardvark in zynga’s farmville today! id2 … id3 …. id4 … id5 … .. Request-handling logic Combine and sort ????
What we’d wanted id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…] C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…] C[X=w3,1^Y=….]=… … What we ended up with … and it’s good enough! Key Value id1 found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …
Implementation summary cat train.dat | CountForNB.py … > eventCounts.dat cat eventCounts.dat | CountsByWord.py | sort \ | CollectRecords.py > words.dat cat test.data | requestWordCounts.py \ | cat - words.dat | sort | answerWordCountRequests.py | cat - test.dat| sort | testNBUsingRequests.py train.dat counts.dat id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … 5245 1054 2120 37 3 …
Implementation summary cat train.dat | CountForNB.py … > eventCounts.dat cat eventCounts.dat | CountsByWord.py | sort \ | CollectRecords.py > words.dat cat test.data | requestWordCounts.py \ | cat - words.dat | sort | answerWordCountRequests.py | cat - test.dat| sort | testNBUsingRequests.py words.dat w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
Implementation summary cat train.dat | CountForNB.py … > eventCounts.dat cat eventCounts.dat | CountsByWord.py | sort \ | CollectRecords.py > words.dat cat test.data | requestWordCounts.py \ | cat - words.dat | sort | answerWordCountRequests.py | cat - test.dat| sort | testNBUsingRequests.py output looks like this input looks like this words.dat found ~ctr to id1 aardvark ~ctr to id2 … today ~ctr to idi w Counts aardvark C[w^Y=sports]=2 agent … zynga w Counts aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]=… ~ctr to id345 ~ctr to id9854 … ~ctr to id34742 zynga C[…]
Implementation summary cat train.dat | CountForNB.py … > eventCounts.dat cat eventCounts.dat | CountsByWord.py | sort \ | CollectRecords.py > words.dat cat test.data | requestWordCounts.py \ | cat - words.dat | sort | answerWordCountRequests.py | cat - test.dat| sort | testNBUsingRequests.py Output looks like this Output: id1 ~ctr for aardvark is C[w^Y=sports]=2 … id1 ~ctr for zynga is …. test.dat id1 found an aardvark in zynga’s farmville today! id2 … id3 …. id4 … id5 … ..
Implementation summary cat train.dat | CountForNB.py … > eventCounts.dat cat eventCounts.dat | CountsByWord.py | sort \ | CollectRecords.py > words.dat cat test.data | requestWordCounts.py \ | cat - words.dat | sort | answerWordCountRequests.py | cat - test.dat| sort | testNBUsingRequests.py Input looks like this Key Value id1 found aardvark zynga farmville today ~ctr for aardvark is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …
Discussion That’s kind of messy
Abstractions For Map-Reduce
Abstractions On Top Of Map-Reduce Some obvious streaming processes: for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example: stem words in a stream of word-count pairs: (“aardvarks”,1) (“aardvark”,1) Proposed syntax: table2 = MAP table1 BY λ row : f(row)) f(row)row’ Example: apply stop words (“aardvark”,1) (“aardvark”,1) (“the”,1) deleted Proposed syntax: table2 = FILTER table1 BY λ row : f(row)) f(row) {true,false}
Abstractions On Top Of Map-Reduce A non-obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) Proposed syntax: table2 = FLATMAP table1 BY λ row : f(row)) f(row)list of rows “i” “found” “an” “aardvark” “we” “love” … Example: tokenizing a line “I found an aardvark” [“i”, “found”,”an”,”aardvark”] “We love zymurgy” [“we”,”love”,”zymurgy”] ..but final table is one word per row
NB Test Step Event counts How: Stream and sort: for each C[X=w^Y=y]=n print “w C[Y=y]=n” sort and build a list of values associated with each key w Like an inverted index X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … 5245 1054 2120 37 3 … w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464
NB Test Step Event counts The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value Grouping operation Special case of a map-reduce X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y=… X=… … 5245 1054 2120 37 3 … w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row)field
NB Test Step Aside: you guys know how to implement this, right? The general case: We’re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value Grouping operation Special case of a map-reduce Output pairs (f(row),row) with a map/streaming process Sort pairs by key – which is f(row) Reduce and aggregate by appending together all the values associated with the same key Proposed syntax: GROUP table BY λ row : f(row) Could define f via: a function, a field of a defined record structure, … f(row)field
Some other stream and sort tasks outlinks from page: src dst1 dst2 … Algorithm: For each input line src dst1 dst2 … dstn print out dst1 inlinks.= src dst2 inlinks.= src … dstn inlinks.= src Sort this output Collect the messages and group to get dst src1 src2 … srcn
Some other stream and sort tasks prevKey = Null sumForPrevKey = 0 For each (event, delta) in input: If event==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null docsForPrevKey = [ ] For each (dst, src) in input: If dst==prevKey docsForPrevKey.append(src) Else OutputPrevKey() prevKey = dst docsForPrevKey = [src] define OutputPrevKey(): If PrevKey!=Null print PrevKey, docsForPrevKey
Abstractions On Top Of Map-Reduce And another example from the Naïve Bayes test program…
Request-and-answer Test data Record of all event counts for each word id1 w1,1 w1,2 w1,3 …. w1,k1 id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 … id5 w5,1 w5,2 …. .. w Counts associated with W aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 Step 2: stream through and for each test case idi wi,1 wi,2 wi,3 …. wi,ki request the event counters needed to classify idi from the event-count DB, then classify using the answers Classification logic
Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group … by Generate the requests as (key, requestor) pairs Eg with flatmap … to Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result
w Request … ~ctr to id2 w Counters … w Counters Requests found aardvark … zynga ~ctr to id2 w Counters aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 w Counters Requests aardvark C[w^Y=sports]=2 ~ctr to id1 agent C[w^Y=sports]=… ~ctr to id345 ~ctr to id9854 … ~ctr to id34742 zynga C[…]
JOIN wordInDoc BY lambda (word,docid):word, Request found id1 aardvark … zynga id2 w Counters aardvark C[w^Y=sports]=2 agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564 … zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464 Examples: JOIN wordInDoc BY lambda (word,docid):word, wordCounters BY lambda (word,counters):word – using python syntax for functions w Counters Requests aardvark C[w^Y=sports]=2 id1 agent C[w^Y=sports]=… id345 id9854 … id34742 zynga C[…] Proposed syntax: JOIN table1 BY λ row : f(row), table2 BY λ row : g(row)
Map vs Reduce-side Joins
Two ways to join Reduce-side join Map-side join
Two ways to join Reduce-side join for A,B term df found 2456 aardvark 7 … A B (aardvark, 7) (aardvark, d15) … ... (found,2456) (found,d7) (found,d23) A concat and sort do the join ( term docId aardvark d15 … ... found d7 d23 B
Two ways to join Reduce-side join for A,B tricky bit: need sort by first two values (aardvark, AB) – we want the DF’s to come first but all tuples with key “aardvark” should go to same worker Reduce-side join for A,B term df found A 2456 aardvark 7 … term df found 2456 aardvark 7 … A B (aardvark, 7) (aardvark, d15) … ... (found,2456) (found,d7) (found,d23) A concat and sort do the join term docId aardvark B d15 … ... found d7 d23 ( term docId aardvark d15 … ... found d7 d23 B
Two ways to join Reduce-side join for A,B tricky bit: need sort by first two values (aardvark, AB) – we want the DF’s to come first but all tuples with key “aardvark” should go to same worker Reduce-side join for A,B term df found A 2456 aardvark 7 … term df found 2456 aardvark 7 … A concat and sort custom sort (secondary sort key): Writeable with your own Comparator term docId aardvark B d15 … ... found d7 d23 term docId aardvark d15 … ... found d7 d23 custom Partitioner (specified for job like the Mapper, Reducer, ..) B
Two ways to join Map-side join write the smaller relation out to disk send it to each Map worker DistributedCache when you initialize each Mapper, load in the small relation Configure(…) is called at initialization time map through the larger relation and do the join faster but requires one relation to go in memory for every mapper
Two ways to join Map-side join for A (small) and B (large) term df found 2456 aardvark 7 … Load into mapper A B (aardvark, 7) (aardvark, d15) … ... (found,2456) (found,d7) (found,d23) A ( term docId aardvark d15 … ... found d7 d23 Join as you map B
Two ways to join Map-side join for A (small) and B (large) term df found 2456 aardvark 7 … Duplicate and load into every mapper A A B (aardvark, 7) (aardvark, d15) … ... ( term docId aardvark d15 … ... B1 (found,2456) (found,d7) (found,d23) … found d7 d23 … B2 Join as you map
PIG: A Workflow Language
PIG: word count example PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed.
Tokenize – built-in function Flatten – special keyword, which applies to the next step in the process – so output is a stream of words w/o document boundaries Built-in regex matching ….
Group produces a stream of bags of identical words… bags, tuples, ictionaries are primitive types Group by … foreach generate count(…) will be optimized into a single map-reduce
Guinea PIG
GuineaPig: PIG in Python Pure Python (< 1500 lines) Streams Python data structures strings, numbers, tuples (a,b), lists [a,b,c] No records: operations defined functionally Compiles to Hadoop streaming pipeline Optimizes sequences of MAPs Runs locally without Hadoop compiles to stream-and-sort pipeline intermediate results can be viewed Can easily run parts of a pipeline http://curtis.ml.cmu.edu/w/courses/index.php/Guinea_Pig
GuineaPig: PIG in Python Pure Python, streams Python data structures not too much new to learn (eg field/record notation, special string operations, UDFs, …) codebase is small and readable Compiles to Hadoop or stream-and-sort, can easily run parts of a pipeline intermediate results often are (and always can be) stored and inspected plan is fairly visible Syntax includes high-level operations but also fairly detailed description of an optimized map-reduce step Flatten | Group(by=…, retaining=…, reducingTo=…)
A wordcount example class variables in the planner are data structures
steps in the compiled plan invoke your script with special args Wordcount example …. A program is converted to a data structure The data structure can be converted to a series of “abstract map-reduce tasks” and then shell commands steps in the compiled plan invoke your script with special args
Wordcount example …. Data structure can be converted to commands for streaming hadoop
Wordcount example …. Of course you won’t access local files with Hadoop, so you need to specify an HDFS location for inputs and outputs
Wordcount example …. Of course you won’t access local files with Hadoop, so you need to specify an HDFS location for inputs and outputs
More examples of GuineaPig Join syntax, macros, Format command Incremental debugging, when intermediate views are stored: % python wrdcmp.py –store result … % python wrdcmp.py –store result –reuse cmp
More examples of GuineaPig Full Syntax for Group Group(wc, by=lambda (word,count):word[:k], retaining=lambda (word,count):count, combiningTo=ReduceToSum(), reducingTo=ReduceToSum()) equiv to: Group(wc, by=lambda (word,count):word[:k], reducingTo= ReduceTo(int, lambda accum,word,count): accum+count))
ANOTHER EXAMPLE: cOMPUTING TFIDF in Guinea Pig
Implementation
Implementation docId term d123 found aardvark … (d123,found)
Implementation docId term d123 found aardvark key value found (d123,found),(d134,found),… 2456 aardvark (d123,aardvark),… 7
Implementation docId term d123 found aardvark key value found 2456 7 ('1', 'quite') ('1', 'a') ('1', 'difference.’) … (’3’, ‘alcohol’) ("'94", 1) ("'94,", 1) ("'a", 1) ("'alcohol", 1) … (('2', "'alcohol"), ("'alcohol", 1)) (('550', "'cause"), ("'cause", 1)) …
Implementation docId term df d123 found 2456 aardvark 7 ('2', "'confabulation'.", 2) ('3', "'confabulation'.", 2) ('209', "'controversy'", 1) ('181', "'em", 3) ('434', "'em", 3) ('452', "'em", 3) ('113', "'fancy", 1) ('212', "'franchise',", 1) ('352', "'honest,", 1)
Implementation: Map-side join Augment: loads a preloaded object b at mapper initialization time, cycles thru the input, and generates pairs (a,b) docId term df d123 found 2456 Arbitrary python object aardvark 7 …
Implementation Augment: loads a preloaded object b at mapper initialization time, cycles thru the input, and generates pairs (a,b), where b points to the preloaded object docId term df d123 found 2456 ptr aardvark 7 … (('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964)) Arbitrary python object
Implementation Augment: loads a preloaded object b at mapper initialization time, cycles thru the input, and generates pairs (a,b), where b points to the preloaded object This looks like a join. But it’s different. It’s a single map, not a map-shuffle/sort-reduce The loaded object is paired with every a, not just ones where the join keys match (but you can use it for a map-side join!) The loaded object has to be distributed to every mapper (so, copied!) (('2', "'confabulation'.", 2), ('ndoc', 964)) (('3', "'confabulation'.", 2), ('ndoc', 964)) (('209', "'controversy'", 1), ('ndoc', 964)) (('181', "'em", 3), ('ndoc', 964)) (('434', "'em", 3), ('ndoc', 964))
Implementation Gotcha: if you store an augment, it’s printed on disk, and Python writes the object pointed to, not the pointer. So when you store you make a copy of the object for every row. docId term df d123 found 2456 ptr aardvark 7 … (('2', "'confabulation'.", 2), printed-object) (('3', "'confabulation'.", 2), printed-object) (('209', "'controversy'", 1), printed-object) (('181', "'em", 3), printed-object) (('434', "'em", 3), printed-object) Arbitrary python object
Full Implementation
TFIDF with map-side joins
TFIDF with map-side joins