Naïve Bayes for Large Vocabularies William W. Cohen.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Overview of this week Debugging tips for ML algorithms
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Naïve-Bayes Classifiers Business Intelligence for Managers.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Naïve Bayes for Large Vocabularies William W. Cohen.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Basics of MapReduce Shannon Quinn. Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Naïve Bayes and Hadoop Shannon Quinn.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Hashing The Magic Container. Interface Main methods: –Void Put(Object) –Object Get(Object) … returns null if not i –… Remove(Object) Goal: methods are.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding with “Request-and-Answer” William W. Cohen.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Using Large-Vocabulary Classifiers William W. Cohen.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Some Details About Stream- and-Sort Operations William W. Cohen.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.
Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS4432: Database Systems II Query Processing- Part 2.
Naïve Bayes with Large Vocabularies (aka, stream- and-sort) Shannon Quinn.
Some More Efficient Learning Methods William W. Cohen.
Scaling up: Naïve Bayes on Hadoop What, How and Why?
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
File Organizations and Indexing
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Phrase Finding; Stream-and-Sort vs “Request-and-Answer” William W. Cohen.
Naïve Bayes for Large Vocabularies and Stream-and-Sort Programming William W. Cohen 1.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Phrase Finding; Stream-and-Sort to “Request-and-Answer” William W. Cohen.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
Chapter 12: Query Processing
Evaluation of Relational Operations
Binary Search Back in the days when phone numbers weren’t stored in cell phones, you might have actually had to look them up in a phonebook. How did you.
Big ML Hadoop.
Lecture 2- Query Processing (continued)

General External Merge Sort
CENG 351 Data Management and File Structures
Presentation transcript:

Naïve Bayes for Large Vocabularies William W. Cohen

Summary to date Computational complexity: what and how to count – Memory vs disk access – Cost of scanning vs seeks for disk (and memory) Probability review Classification with a “density estimator” Naïve Bayes as a density estimator/classifier How to evaluate classifiers and density estimators How to implement Naïve Bayes – Time is linear in size of data (one scan!) – Assuming the event counters fit in memory – We need to count C( Y=label), C( X=word ^ Y=label), C( X=word)

Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches – Traditional “DB” (actually, key-value store) – Memory-based distributed DB – Stream-and-sort counting Optimizations Other tasks for stream-and-sort A detail about large-vocabulary Naïve Bayes…..

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1 Complexity: O( n), n= size of train Complexity: O(| dom(Y)|*n’), n’= size of test Assume hashtable holding all counts fits in memory Sequential reads

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Why? Micro: 0.6G memory Standard : S: 1.7Gb L: 7.5Gb XL: 15Mb Hi Memory : XXL: 34.2 XXXXL: 68.4

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Why? – Zipf’s law: many words that you see, you don’t see often.

[Via Bruce Croft]

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Why? Heaps’ Law: If V is the size of the vocabulary and the n is the length of the corpus in words: Typical constants: – K  1/10  1/100 –   0.4  0.6 (approx. square-root) Why? – Proper names, missspellings, neologisms, … Summary: – For text classification for a corpus with O(n) words, expect to use O(sqrt(n)) storage for vocabulary. – Scaling might be worse for other cases (e.g., hypertext, phrases, …)

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a database? (or at least a key-value store)

Numbers (Jeff Dean says) Everyone Should Know ~= 10x ~= 15x ~= 100,000x 40x

Using a database for Big ML We often want to do random access on big data – E.g., different versions of examples for q/a – E.g., spot-checking parameter weights to see if they are sensible Simplest approach: – Sort the data and use binary search  O(log 2 n) seeks to find query row

Using a database for Big ML We often want to do random access on big data – E.g., different versions of examples for q/a – E.g., spot-checking parameter weights to see if they are sensible Almost-as-simple idea based on fact that disk seek time ~= reading 1Mb – Let K=rows/Mb (e.g., K=1000) – Scan through data once and record the seek position of every K-th row in an index file (or memory) – To find row r: Find the r’, last item in the index smaller than r Seek to r’ and read the next megabyte Cheap since index is size n/1000 Cost is ~= 2 seeks

A simple indexed store (e.g., Hadoop MapFile) KeyValue Y=sports134,242 Y=sports^X=aardvark2 Y=sports^X=aaron7,212 …. Y=sports^X=bengals5,432 … … Y=sports^X=zymurgist7 Y=worldNews345,007 Y=worldNews^… … 1Mb KeyPtr Y=sports 0 Y=sports^X=bengals Y=sports^X=zymurgist … … Accessing r: Scan index to find last r’<= r (no disk access) Seek to r’ s position in DB (10ms) Read next 1Mb sequentially (10ms) Data on disk Index in memory

Using a database for Big ML Summary: we’ve gone from ~= 1 seek (best possible) to ~= 2 seeks---plus finding r’ in the index. – If index is O(1Mb) then finding r’ is also like 1 seek – So we’re paying about 3 seeks per random access in a Gb What if the index is still large? – Build (the same sort of index) for the index! – Now we’re paying 4 seeks for each random access into a Tb – ….and repeat recursively if you need This is called a B-tree – It only gets complicated when we want to delete and insert.

Numbers (Jeff Dean says) Everyone Should Know ~= 10x ~= 15x ~= 100,000x 40x Best case (data is in same sector/block)

A single large file can be spread out among many non-adjacent blocks/sectors… and then you need to seek around to scan the contents of the file… Question: What could you do to reduce this cost? A single large file can be spread out among many non-adjacent blocks/sectors… and then you need to seek around to scan the contents of the file… Question: What could you do to reduce this cost?

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a database? Counts are stored on disk, not in memory …So, accessing a count might involve some seeks – Caveat: many DBs are good at caching frequently-used values, so seeks might be infrequent ….. O(n*scan)  O(n*scan*4*seek)

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a memory-based distributed database? Counts are stored on disk, not in memory …So, accessing a count might involve some seeks – Caveat: many DBs are good at caching frequently-used values, so seeks might be infrequent ….. O(n*scan)  O(n*scan*???)

Counting example 1 example 2 example 3 …. Counting logic Hash table, database, etc “increment C[ x] by D”

Counting example 1 example 2 example 3 …. Counting logic Hash table, database, etc “increment C[ x] by D” Hashtable issue: memory is too small Database issue: seeks are slow

Distributed Counting example 1 example 2 example 3 …. Counting logic Hash table1 “increment C[ x] by D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Now we have enough memory….

Distributed Counting example 1 example 2 example 3 …. Counting logic Hash table1 “increment C[ x] by D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 New issues: Machines and memory cost $$! Routing increment requests to right machine Sending increment requests across the network Communication complexity

Numbers (Jeff Dean says) Everyone Should Know ~= 10x ~= 15x ~= 100,000x 40x

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a memory-based distributed database? Extra cost: Communication costs: O(n) … but that’s “ok” Extra complexity: routing requests correctly – Note: If the increment requests were ordered seeks would not be needed! O(n*scan)  O(n*scan+n*send) 1) Distributing data in memory across machines is not as cheap as accessing memory locally because of communication costs. 2) The problem we’re dealing with is not size. It’s the interaction between size and locality: we have a large structure that’s being accessed in a non-local way.

What’s next How to implement Naïve Bayes – Assuming the event counters do not fit in memory Possible approaches: – Use a memory-based distributed database? Extra cost: Communication costs: O(n) … but that’s “ok” Extra complexity: routing requests correctly – Compress the counter hash table? Use integers as keys instead of strings? Use approximate counts? Discard infrequent/unhelpful words? – Trade off time for space somehow? Observation: if the counter updates were better-ordered we could avoid using disk Great ideas which we’ll discuss more later O(n*scan)  O(n*scan+n*send)

Large-vocabulary Naïve Bayes One way trade off time for space: – Assume you need K times as much memory as you actually have – Method: Construct a hash function h(event) For i=0,…,K-1 : – Scan thru the train dataset – Increment counters for event only if h(event) mod K == i – Save this counter set to disk at the end of the scan After K scans you have a complete counter set Comment: – this works for any counting task, not just naïve Bayes – What we’re really doing here is organizing our “messages” to get more locality…. Counting

Large vocabulary counting Another approach: – Start with Q: “what can we do for large sets quickly”? A: sorting – It’s O(n log n), not much worse than linear – You can do it for very large datasets using a merge sort » sort k subsets that fit in memory, » merge results, which can be done in linear time

Alternative visualization

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – Print “Y=ANY += 1” – Print “Y= y += 1” – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” Scan the sorted messages and compute and output the final counter values Think of these as “messages” to another component to increment the counters java MyTrainertrain | sort | java MyCountAdder > model

Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – Print “Y=ANY += 1” – Print “Y= y += 1” – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” – We’re collecting together messages about the same counter Scan and add the sorted messages and output the final counter values Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 …

Large-vocabulary Naïve Bayes Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 … previousKey = Null sumForPreviousKey = 0 For each ( event,delta ) in input: If event ==previousKey sumForPreviousKey += delta Else OutputPreviousKey() previousKey = event sumForPreviousKey = delta OutputPreviousKey() define OutputPreviousKey(): If PreviousKey!=Null print PreviousKey,sumForPreviousKey Accumulating the event counts requires constant storage … as long as the input is sorted. streaming Scan-and-add:

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic Hash table1 “C[ x] +=D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Message-routing logic

Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machine A Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machine C Machine B

Stream and Sort Counting  Distributed Counting example 1 example 2 example 3 …. Counting logic “C[ x] +=D” Machines A1,… Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machines C1,.., Machines B1,…, Trivial to parallelize! Easy to parallelize! Standardized message routing logic

Locality is good Micro: 0.6G memory Standard : S: 1.7Gb L: 7.5Gb XL: 15Mb Hi Memory : XXL: 34.2 XXXXL: 68.4

Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in train: – Print Y=ANY += 1 – Print Y= y += 1 – For j in 1..d : Print Y=y ^ X=x j += 1 Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values Complexity: O( n), n= size of train Complexity: O( nlogn) Complexity: O( n) O(|V||dom(Y)|) (Assuming a constant number of labels apply to each document) java MyTrainertrain | sort | java MyCountAdder > model Model size: max O(n), O(|V||dom(Y)|)

Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches – Traditional “DB” (actually, key-value store) – Memory-based distributed DB – Stream-and-sort counting Optimizations Other tasks for stream-and-sort

Optimizations java MyTrainertrain | sort | java MyCountAdder > model O(n) Input size=n Output size=n O(nlogn) Input size=n Output size=n O(n) Input size=n Output size=m m<<n … say O(sqrt(n)) A useful optimization: decrease the size of the input to the sort Reduces the size from O(n) to O(m) 1.Compress the output by using simpler messages (“C[ event] += 1”)  “ event 1” 2.Compress the output more – e.g. string  integer code Tradeoff – ease of debugging vs efficiency – are messages meaningful or meaningful in context?

Optimization… For each example id, y, x 1,….,x d in train: – Print “ Y=ANY += 1” – Print “ Y=y += 1” – For j in 1..d : Print “ Y=y ^ X=x j += 1” Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values java MyTrainertrain | sort | java MyCountAdder > model Initialize hashtable C For each example id, y, x 1,….,x d in train: – C[ Y=ANY ] += 1 – C[ Y=y ] += 1 – For j in 1..d : C[ Y=y ^ X=x j ] += 1 If memory is getting full: output all values from C as messages and re-initialize C Sort the event-counter update “messages” Scan and add the sorted messages

Some other stream and sort tasks Coming up: classify Wikipedia pages – Features: words on page: src w 1 w 2 …. outlinks from page: src dst 1 dst 2 … how about inlinks to the page?

Some other stream and sort tasks outlinks from page: src dst 1 dst 2 … – Algorithm: For each input line src dst 1 dst 2 … dst n print out – dst 1 inlinks.= src – dst 2 inlinks.= src – … – dst n inlinks.= src Sort this output Collect the messages and group to get – dst src 1 src 2 … src n

Some other stream and sort tasks prevKey = Null sumForPrevKey = 0 For each ( event += delta ) in input: If event ==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null linksToPrevKey = [ ] For each ( dst inlinks.= src ) in input: If dst ==prevKey linksPrevKey.append( src) Else OutputPrevKey() prevKey = dst linksToPrevKey=[ src ] OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey, linksToPrevKey

Some other stream and sort tasks What if we run this same program on the words on a page? – Features: words on page: src w 1 w 2 …. outlinks from page: src dst 1 dst 2 … Out2In.java w 1 src 1,1 src 1,2 src 1,3 …. w 2 src 2,1 … … an inverted index for the documents

Some other stream and sort tasks outlinks from page: src dst 1 dst 2 … – Algorithm: For each input line src dst 1 dst 2 … dst n print out – dst 1 inlinks.= src – dst 2 inlinks.= src – … – dst n inlinks.= src Sort this output Collect the messages and group to get – dst src 1 src 2 … src n

Some other stream and sort tasks prevKey = Null sumForPrevKey = 0 For each ( event += delta ) in input: If event ==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null linksToPrevKey = [ ] For each ( dst inlinks.= src ) in input: If dst ==prevKey linksPrevKey.append( src) Else OutputPrevKey() prevKey = dst linksToPrevKey=[ src ] OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey, linksToPrevKey

Some other stream and sort tasks Later on: distributional clustering of words

Some other stream and sort tasks Later on: distributional clustering of words Algorithm: For each word w in a corpus print w and the words in a window around it – Print “ wi context.= (w i-k,…,w i-1,w i+1,…,w i+k )” Sort the messages and collect all contexts for each w – thus creating an instance associated with w Cluster the dataset – Or train a classifier and classify it

Some other stream and sort tasks prevKey = Null sumForPrevKey = 0 For each ( event += delta ) in input: If event ==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null ctxOfPrevKey = [ ] For each ( w c.= w 1,…,w k ) in input: If dst ==prevKey ctxOfPrevKey.append( w 1,…,w k ) Else OutputPrevKey() prevKey = w ctxOfPrevKey=[ w 1,..,w k ] OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey, ctxOfPrevKey

Some other stream and sort tasks Finding unambiguous geographical names GeoNames.org: for each place in its database, stores – Several alternative names – Latitude/Longitude – … Lets you put places on a map (e.g., Google Maps) Problem: many names are ambiguous, especially if you allow an approximate match – Paris, London, … even Carnegie Mellon

Point Park (College|University) Carnegie Mellon [University [School]]

Some other stream and sort tasks Finding almost unambiguous geographical names GeoNames.org: for each place in the database – print all plausible soft-match substrings in each alternative name, paired with the lat/long, e.g. Carnegie Mellon University at lat1,lon1 Carnegie Mellon at lat1,lon1 Mellon University at lat1,lon1 Carnegie Mellon School at lat2,lon2 Carnegie Mellon at lat2,lon2 Mellon School at lat2,lon2 … – Sort and collect… and filter

Some other stream and sort tasks prevKey = Null sumForPrevKey = 0 For each ( event += delta ) in input: If event ==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null locOfPrevKey = Gaussian() For each ( place at lat,lon ) in input: If dst ==prevKey locOfPrevKey.observe( lat, lon) Else OutputPrevKey() prevKey = place locOfPrevKey = Gaussian() locOfPrevKey.observe( lat, lon) OutputPrevKey() define OutputPrevKey(): If PrevKey!=Null and locOfPrevKey.stdDev() < 1 mile print PrevKey, locOfPrevKey.avg()

Today Naïve Bayes with huge feature sets – i.e. ones that don’t fit in memory Pros and cons of possible approaches – Traditional “DB” (actually, key-value store) – Memory-based distributed DB – Stream-and-sort counting Optimizations Other tasks for stream-and-sort A detail about large-vocabulary Naïve Bayes…..

Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: – C(“ Y =ANY”) ++; C(“ Y=y”) ++ – For j in 1..d : C(“ Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = – Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1 Complexity: O( n), n= size of train Complexity: O(| dom(Y)|*n’), n’= size of test Assume hashtable holding all counts fits in memory Sequential reads

Using Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in test: Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = Model size: max O(n), O(|V||dom(Y)|)

Using Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in test: Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,….,x d in test: – Add x 1,….,x d to NEEDED For each event, C(event) in the summed counters – If event involves a NEEDED term x read it into C For each example id, y, x 1,….,x d in test: – For each y’ in dom(Y): Compute log Pr (y’,x 1,….,x d ) = …. [For assignment] Model size: O (|V|) Time: O (n 2 ), size of test Memory : same Time: O (n 2 ) Memory: same Time: O (n 2 ) Memory: same

Large-Vocabulary Naïve Bayes Learning/CountingUsing Counts Assignment: – Scan through counts to find those needed for test set – Classify with counts in memory Put counts in a database Use partial counts and repeated scans of the test data? Re-organize the counts and test set so that you can classify in a stream Counts on disk with a key- value store Counts as messages to a set of distributed processes Repeated scans to build up partial counts Counts as messages in a stream-and-sort system Assignment: Counts as messages but buffered in memory