Naïve Bayes with Large Vocabularies (aka, stream- and-sort) Shannon Quinn.

Slides:



Advertisements
Similar presentations
Storing Data: Disk Organization and I/O
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Naïve-Bayes Classifiers Business Intelligence for Managers.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
CS252: Systems Programming Ninghui Li Program Interview Questions.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Naïve Bayes for Large Vocabularies William W. Cohen.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Basics of MapReduce Shannon Quinn. Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Naïve Bayes and Hadoop Shannon Quinn.
Hashing The Magic Container. Interface Main methods: –Void Put(Object) –Object Get(Object) … returns null if not i –… Remove(Object) Goal: methods are.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding with “Request-and-Answer” William W. Cohen.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Some Details About Stream- and-Sort Operations William W. Cohen.
Naïve Bayes for Large Vocabularies William W. Cohen.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Some More Efficient Learning Methods William W. Cohen.
Scaling up: Naïve Bayes on Hadoop What, How and Why?
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
Naïve Bayes for Large Vocabularies and Stream-and-Sort Programming William W. Cohen 1.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Phrase Finding; Stream-and-Sort to “Request-and-Answer” William W. Cohen.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
Module 11: File Structure
Chapter 12: Query Processing
Evaluation of Relational Operations
10-605: Map-Reduce Workflows
Algorithm Design and Analysis (ADA)
Database Implementation Issues
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 5: Computer Systems Organization
Lecture 2- Query Processing (continued)

CENG 351 Data Management and File Structures
COS 518: Distributed Systems Lecture 11 Mike Freedman
Database Implementation Issues
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Naïve Bayes with Large Vocabularies (aka, stream- and-sort) Shannon Quinn

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ C(“ Y=y ^ X=ANY ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q x = 1/|V| q y = 1/|dom(Y)| m=1 2

SCALING TO LARGE VOCABULARIES: WHY 3

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Why? Micro: 0.6G memory Standard : S: 1.7Gb L: 7.5Gb XL: 15Mb Hi Memory : XXL: 34.2 XXXXL:

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Why? – Zipf’s law: many words that you see, you don’t see often. 5

[Via Bruce Croft] 6

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Why? Heaps’ Law: If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: – K  1/10  1/100 –   0.4  0.6 (approx. square-root) Why? – Proper names, missspellings, neologisms, … Summary: – For text classification for a corpus with O(n) words, expect to use O(sqrt(n)) storage for vocabulary. – Scaling might be worse for other cases (e.g., hypertext, phrases, …) 7

SCALING TO LARGE VOCABULARIES: POSSIBLE ANSWERS TO “HOW” 8

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a database? (or at least a key-value store) 9

Numbers (Jeff Dean says) Everyone Should Know ~= 10x ~= 15x ~= 100,000x 40x 10

11

Numbers (Jeff Dean says) Everyone Should Know ~= 10x ~= 15x ~= 100,000x 40x Best case (data is in same sector/block) 12

A single large file can be spread out among many non-adjacent blocks/sectors… and then you need to seek around to scan the contents of the file… A single large file can be spread out among many non-adjacent blocks/sectors… and then you need to seek around to scan the contents of the file… 13

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a database? Counts are stored on disk, not in memory …So, accessing a count might involve some seeks – Caveat: many DBs are good at caching frequently-used values, so seeks might be infrequent ….. O(n*scan)  O(n*scan*seek) 14

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a memory-based distributed database? Counts are stored on disk, not in memory …So, accessing a count might involve some seeks – Caveat: many DBs are good at caching frequently-used values, so seeks might be infrequent ….. O(n*scan)  O(n*scan*???) 15

Counting example 1 example 2 example 3 …. Counting logic Hash table, database, etc “increment C[ x] by D” 16

Counting example 1 example 2 example 3 …. Counting logic Hash table, database, etc “increment C[ x] by D” Hashtable issue: memory is too small Database issue: seeks are slow 17

Distributed Counting example 1 example 2 example 3 …. Counting logic Hash table1 “increment C[ x] by D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Now we have enough memory…. 18

Distributed Counting example 1 example 2 example 3 …. Counting logic Hash table1 “increment C[ x] by D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 New issues: Machines and memory cost $$! Routing increment requests to right machine Sending increment requests across the network Communication complexity 19

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a memory-based distributed database? Extra cost: Communication costs: O(n) … but that’s “ok” Extra complexity: routing requests correctly – Note: If the increment requests were ordered seeks would not be needed! O(n*scan)  O(n*scan+n*send) 1) Distributing data in memory across machines is not as cheap as accessing memory locally because of communication costs. 2) The problem we’re dealing with is not size. It’s the interaction between size and locality: we have a large structure that’s being accessed in a non-local way. 20

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a memory-based distributed database? Extra cost: Communication costs: O(n) … but that’s “ok” Extra complexity: routing requests correctly – Compress the counter hash table? Use integers as keys instead of strings? Use approximate counts? Discard infrequent/unhelpful words? – Trade off time for space somehow? Observation: if the counter updates were better-ordered we could avoid using disk Great ideas which we’ll touch on later… O(n*scan)  O(n*scan+n*send) 21

Large vocabulary counting Start with – Q: “what can we do for large sets quickly”? – A: sorting – It’s O(n log n), not much worse than linear – You can do it for very large datasets using a merge sort » sort k subsets that fit in memory, » merge results, which can be done in linear time 22

Wikipedia on Old-School Merge Sort Use four tape drives A,B,C,D 1. merge runs from A,B and write them alternately into C,D 2. merge runs from C,D and write them alternately into A,B 3. And so on…. Requires only constant memory. Use four tape drives A,B,C,D 1. merge runs from A,B and write them alternately into C,D 2. merge runs from C,D and write them alternately into A,B 3. And so on…. Requires only constant memory. 23

THE STREAM-AND-SORT DESIGN PATTERN FOR NAIVE BAYES 24

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ 25

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – Print “Y=ANY += 1” – Print “Y= y += 1” – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” Scan the sorted messages and compute and output the final counter values Think of these as “messages” to another component to increment the counters java MyTrainertrain | sort | java MyCountAdder > model 26

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – Print “Y=ANY += 1” – Print “Y= y += 1” – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” – We’re collecting together messages about the same counter Scan and add the sorted messages and output the final counter values Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 … 27

Large-vocabulary Naïve Bayes Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 … previousKey = Null sumForPreviousKey = 0 For each ( event,delta ) in input: If event ==previousKey sumForPreviousKey += delta Else OutputPreviousKey() previousKey = event sumForPreviousKey = delta OutputPreviousKey() define OutputPreviousKey(): If PreviousKey!=Null print PreviousKey,sumForPreviousKey Accumulating the event counts requires constant storage … as long as the input is sorted. streaming Scan-and-add: 28

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic Hash table1 “C[ x] +=D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Message-routing logic 29

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machine A Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machine C Machine B 30

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machines A1,… Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machines C1,.., Machines B1,…, Trivial to parallelize! Easy to parallelize! Standardized message routing logic 31

Locality is good Micro: 0.6G memory Standard : S: 1.7Gb L: 7.5Gb XL: 15Mb Hi Memory : XXL: 34.2 XXXXL:

Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in train: – Print Y=ANY += 1 – Print Y= y += 1 – For j in 1..d : Print Y=y ^ X=x j += 1 Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values Complexity: O( n), n= size of train Complexity: O( nlogn) Complexity: O( n) (Assuming a constant number of labels apply to each document) java MyTrainertrain | sort | java MyCountAdder > model Model size: min( O(n), O(|V||dom(Y)|)) 33

STREAM-AND-SORT + LOCAL PARTIAL COUNTING 34

Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches – Traditional “DB” (actually, key-value store) – Memory-based distributed DB – Stream-and-sort counting Optimizations Other tasks for stream-and-sort 35

Optimizations java MyTrainertrain | sort | java MyCountAdder > model O(n) Input size=n Output size=n O(nlogn) Input size=n Output size=n O(n) Input size=n Output size=m m<<n … say O(sqrt(n)) A useful optimization: decrease the size of the input to the sort Reduces the size from O(n) to O(m) 1.Compress the output by using simpler messages (“C[ event] += 1”)  “ event 1” 2.Compress the output more – e.g. string  integer code Tradeoff – ease of debugging vs efficiency – are messages meaningful or meaningful in context? 36

Optimization: partial local counting For each example id, y, x 1,….,x d in train: – Print “ Y=y += 1” – For j in 1..d : Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values java MyTrainertrain | sort | java MyCountAdder > model Initialize hashtable C For each example id, y, x 1,….,x d in train: – C[ Y=y ] += 1 – For j in 1..d : C[ Y=y ^ X=x j ] += 1 If memory is getting full: output all values from C as messages and re-initialize C Sort the event-counter update “messages” Scan and add the sorted messages 37

Review: Large-vocab Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C.inc(“ Y=y”) – For j in 1..d : C.inc(“ Y=y ^ X=x j ”) class EventCounter { void inc(String event) { // increment the right hashtable slot if (hashtable.size()>BUFFER_SIZE) { for (e,n) in hashtable.entries : print e + “\t” + n hashtable.clear(); } 38

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machine A Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machine C Machine B BUFFER 39

How much does buffering help? small-events.txt: nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB < RCV1.small_train.txt \ | sort -k1,1 \ | java -cp nb.jar com.wcohen.StreamSumReducer> small-events.txt test-small: small-events.txt nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB \ RCV1.small_test.txt MCAT,CCAT,GCAT,ECAT 2000 < small-events.txt \ | cut -f3 | sort | uniq -c small-events.txt: nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB < RCV1.small_train.txt \ | sort -k1,1 \ | java -cp nb.jar com.wcohen.StreamSumReducer> small-events.txt test-small: small-events.txt nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB \ RCV1.small_test.txt MCAT,CCAT,GCAT,ECAT 2000 < small-events.txt \ | cut -f3 | sort | uniq -c BUFFER_SIZETimeMessage Size none 1.7M words 10047s1.2M 1,00042s1.0M 10,00030s0.7M 100,00016s0.24M 1,000,00013s0.16M limit 0.05M 40

CONFESSION: THIS NAÏVE BAYES HAS A PROBLEM…. 41

Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches – Traditional “DB” (actually, key-value store) – Memory-based distributed DB – Stream-and-sort counting Optimizations Other tasks for stream-and-sort Finally : A “detail” about large-vocabulary Naïve Bayes….. 42

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ …. For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1 Complexity: O( n), n= size of train Complexity: O(| dom(Y)|*n’), n’= size of test Assume hashtable holding all counts fits in memory Sequential reads 43

Using Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in train: Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = Model size: max O(n), O(|V||dom(Y)|) 44

Using Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in train: Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,….,x d in test: – Add x 1,….,x d to NEEDED For each event, C(event) in the summed counters – If event involves a NEEDED term x read it into C For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = …. [For assignment] Model size: O (|V|) Time: O (n 2 ), size of test Memory : same Time: O (n 2 ) Memory: same Time: O (n 2 ) Memory: same 45

Large-Vocabulary Naïve Bayes Learning/CountingUsing Counts Assignment: – Scan through counts to find those needed for test set – Classify with counts in memory Put counts in a database Use partial counts and repeated scans of the test data? Re-organize the counts and test set so that you can classify in a stream Counts on disk with a key- value store Counts as messages to a set of distributed processes Repeated scans to build up partial counts Counts as messages in a stream-and-sort system Assignment: Counts as messages but buffered in memory 46

LOOKING AHEAD TO PARALLELIZATION… 47

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machines A1,… Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machines C1,.., Machines B1,…, Trivial to parallelize! Easy to parallelize! Standardized message routing logic 48

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machines A1,… Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machines C1,.., Spill 1 Spill 2 Spill 3 … Merge Spill Files 49

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Counter Machine Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Combiner Machine Spill 1 Spill 2 Spill 3 … Merge Spill Files 50

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic Counter Machine 1 Partition/Sort Spill 1 Spill 2 Spill 3 … C[x1] += D1 C[x1] += D2 …. C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Combiner Machine 1 Merge Spill Files example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Counter Machine 2 Partition/Sort Spill 1 Spill 2 Spill 3 … C[x1] += D1 C[x1] += D2 …. C[x1] += D1 C[x1] += D2 …. combine counter updates Combiner Machine 2 Merge Spill Files Spill n 51

Assignment 1 Memory-restricted Streaming Naïve Bayes – Out: Tomorrow! – Due: Friday, Sept 4, 11:59:59pm Java JVM memory usage cannot exceed 128M

Assignment 1 Memory-restricted Streaming Naïve Bayes Dataset – Reuters corpus – News topics split into a hierarchy of categories – Given the text, can you predict the category?