Some Details About Stream- and-Sort Operations William W. Cohen.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Overview of this week Debugging tips for ML algorithms
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Naïve Bayes for Large Vocabularies William W. Cohen.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Basics of MapReduce Shannon Quinn. Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
External Sorting R & G Chapter 13 One of the advantages of being
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
File Systems Implementation. 2 Recap What we have covered: –User-level view of FS –Storing files: contiguous, linked list, memory table, FAT, I-nodes.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
The Computer Processor
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding with “Request-and-Answer” William W. Cohen.
Notice: Changed TA Office hour Thursday 11am-1pm  noon-2pm.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
1 I/O Management and Disk Scheduling Chapter Categories of I/O Devices Human readable –Used to communicate with the user –Printers –Video display.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Pipes and Filters Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Naïve Bayes for Large Vocabularies William W. Cohen.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.
Hashing Hashing is another method for sorting and searching data.
Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.
CPSC 461 Final Review I Hessam Zakerzadeh Dina Said.
CS4432: Database Systems II Query Processing- Part 2.
Naïve Bayes with Large Vocabularies (aka, stream- and-sort) Shannon Quinn.
Some More Efficient Learning Methods William W. Cohen.
Scaling up: Naïve Bayes on Hadoop What, How and Why?
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
Naïve Bayes for Large Vocabularies and Stream-and-Sort Programming William W. Cohen 1.
Static block can be used to check conditions before execution of main begin, Suppose we have developed an application which runs only on Windows operating.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Phrase Finding; Stream-and-Sort to “Request-and-Answer” William W. Cohen.
CS Class 04 Topics  Selection statement – IF  Expressions  More practice writing simple C++ programs Announcements  Read pages for next.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Announcements HW1b is going out today You should now be on autolab
Announcements Working AWS codes will be out soon
Basics of digital systems
10-605: Map-Reduce Workflows
CS179G, Project In Computer Science
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Big ML Hadoop.
Lecture 2- Query Processing (continued)

GCSE OCR 1 The CPU Computer Science J276 Unit 1
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
These notes were largely prepared by the text’s author
External Sorting Adapt fastest internal-sort methods.
External Sorting Chapter 13
Presentation transcript:

Some Details About Stream- and-Sort Operations William W. Cohen

MERGE SORTS

Bottom-Up Merge Sort use: input array A[n] ; buffer array B[n] assert: A[ ] contains sorted runs of length r=1 for run-length r=1, 2,4,8,… merge adjacent length- r runs in A[ ], copying the result into the buffer B[ ] assert: B[ ] contains sorted runs of length 2*r swap roles of A and B

Wikipedia on Old-School Merge Sort Use four tape drives A,B,C,D 1. merge runs from A,B and write them alternately into C,D 2. merge runs from C,D and write them alternately into A,B 3. And so on…. Requires only constant memory. Use four tape drives A,B,C,D 1. merge runs from A,B and write them alternately into C,D 2. merge runs from C,D and write them alternately into A,B 3. And so on…. Requires only constant memory.

21 st Century Sorting

Unix Sort Load as much as you can [actually --buffer-size=SIZE] into memory and do an in- memory sort [usually quicksort]. If you have more to do, then spill this sorted buffer out on to disk, and get a another buffer’s worth of data. Finally, merge your spill buffers. Load as much as you can [actually --buffer-size=SIZE] into memory and do an in- memory sort [usually quicksort]. If you have more to do, then spill this sorted buffer out on to disk, and get a another buffer’s worth of data. Finally, merge your spill buffers.

PIPES

How Unix Pipes Work Processes are all started at the same time Data streaming thru the paper is held in a queue: writer  […queue…]  reader If the queue is full : – the writing process is blocked If the queue is empty : – the reading process is blocked (I think) queues are usually smallish: 64k

How stream-and-sort works Pipeline is stream  […queue…]  sort Algorithm you get: – sort reads --buffer-size lines in, sorts them, spills them to disk – sort merges spill files after stream closes – stream is blocked when sort falls behind – and sort is blocked if it gets ahead

LOOKING AHEAD TO PARALLELIZATION…

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machines A1,… Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machines C1,.., Machines B1,…, Trivial to parallelize! Easy to parallelize! Standardized message routing logic

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machines A1,… Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machines C1,.., Spill 1 Spill 2 Spill 3 … Merge Spill Files

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Counter Machine Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Combiner Machine Spill 1 Spill 2 Spill 3 … Merge Spill Files

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic Counter Machine 1 Partition/Sort Spill 1 Spill 2 Spill 3 … C[x1] += D1 C[x1] += D2 …. C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Combiner Machine 1 Merge Spill Files example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Counter Machine 2 Partition/Sort Spill 1 Spill 2 Spill 3 … C[x1] += D1 C[x1] += D2 …. C[x1] += D1 C[x1] += D2 …. combine counter updates Combiner Machine 2 Merge Spill Files Spill n

COMMENTS ON BUFFERING

Review: Large-vocab Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – Print “Y=ANY += 1” – Print “Y= y += 1” – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” Scan the sorted messages and compute and output the final counter values Think of these as “messages” to another component to increment the counters java MyTrainertrain | sort | java MyCountAdder > model

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – Print “Y=ANY += 1” – Print “Y= y += 1” – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” – We’re collecting together messages about the same counter Scan and add the sorted messages and output the final counter values Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 …

Large-vocabulary Naïve Bayes Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 … previousKey = Null sumForPreviousKey = 0 For each ( event,delta ) in input: If event ==previousKey sumForPreviousKey += delta Else OutputPreviousKey() previousKey = event sumForPreviousKey = delta OutputPreviousKey() define OutputPreviousKey(): If PreviousKey!=Null print PreviousKey,sumForPreviousKey Accumulating the event counts requires constant storage … as long as the input is sorted. streaming Scan-and-add:

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic Hash table1 “C[ x] +=D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Message-routing logic

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machine A Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machine C Machine B BUFFER

Review: Large-vocab Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C.inc(“ Y =ANY”); C.inc(“ Y=y”) – For j in 1..d : C.inc(“ Y=y ^ X=x j ”) class EventCounter { void inc(String event) { // increment the right hashtable slot if (hashtable.size()>BUFFER_SIZE) { for (e,n) in hashtable.entries : print e + “\t” + n hashtable.clear(); }

How much does buffering help? small-events.txt: nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB < RCV1.small_train.txt \ | sort -k1,1 \ | java -cp nb.jar com.wcohen.StreamSumReducer> small-events.txt test-small: small-events.txt nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB \ RCV1.small_test.txt MCAT,CCAT,GCAT,ECAT 2000 < small-events.txt \ | cut -f3 | sort | uniq -c small-events.txt: nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB < RCV1.small_train.txt \ | sort -k1,1 \ | java -cp nb.jar com.wcohen.StreamSumReducer> small-events.txt test-small: small-events.txt nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB \ RCV1.small_test.txt MCAT,CCAT,GCAT,ECAT 2000 < small-events.txt \ | cut -f3 | sort | uniq -c BUFFER_SIZETimeMessage Size none 1.7M words 10047s1.2M 1,00042s1.0M 10,00030s0.7M 100,00016s0.24M 1,000,00013s0.16M limit 0.05M