Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.

Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.govramanathana@ornl.gov

2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Slides pilfered from… William Cohen’s class (10-605) Andrew Moore’s class (10-601) Shannon Quinn’s class (CSCI-6900) Jure Leskovec (mmds.org) More (attributed as they come)!

3 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Recalling Naïve Bayes You have a train dataset and a test dataset Initialize an “event counter” (hashtable) C For each example id, y, x 1,….,x d in train: –C(“Y=ANY”) ++; C(“Y=y”) ++ –For j in 1..d: C(“Y=y ^ X=x j ”) ++ For each example id, y, x 1,….,x d in test: –For each y’ in dom(Y): Compute log Pr(y’,x 1,….,x d ) = –Return the best y’ where: q j = 1/|V| q y = 1/|dom(Y)| m=1 Assumptions we made in last class: Hashtable will fit in memory! Vocabulary (words faced while learning is limited! Assumptions we made in last class: Hashtable will fit in memory! Vocabulary (words faced while learning is limited!

4 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Implementing Naïve Bayes with Big Data Assuming that hashtable/ counters do not fit into memory Motivation: –Zipf’s Law: many words that you see are not seen that often increasing costs…

5 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Via Bruce Croft

6 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Implementing Naïve Bayes with constraints Assuming that hashtable/ counters do not fit into memory Motivation: –Heaps’ law: if V is the size of the vocabulary, n is the length of the corpus of words soure: Wikipedia (Heaps’ Law) K  0.1 - 0.01 β  0.4 - 0.6

7 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Storage and Compute Requirements For text classification for a corpus of O(n) words, expect to use O(√n) storage for vocabulary What about other features? –Hypertext –Phrases –Keywords…

8 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Possible Approaches… Using Relational Database –or key value store Accessing dataTimes L1 cache reference0.5 ns Branch mispredict5 ns L2 cache reference7 ns Mutex lock/unlock100 ns Main memory reference100 ns Compress 1 KB with Zippy10,000 ns Send 2KB over 1 GBPS network20,000 ns Read 1 MB from memory (sequentially)250,000 ns Transfer & copying data within datacenter500,000 ns Disk seek10,000,000 ns Read 1 MB sequentially from network10,000,000 ns Read 1 MB sequentially from disk30,000,000 ns Send packet from CA  Netherlands  CA150,000,000 ns ~10x ~15x Numbers that every CS person should know according to Jeff Dean…

9 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Using a relational DB for Big ML Prefer random access of big data: –e.g., checking if parameter weights are sensible Simplest approach: –sort the data and use binary search –Advantage: O(log 2 n) to access query row –Let K = rows per Mb –Scan the data once and record the seek position of K th row in an index file (or memory) –To find a row r: Find r’ or the last item in the index smaller than r Seek to r’ and read the next Mb Cheap since index is size n/1000 Cost is ~= 2 seeks

10 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Using a relational DB for Big ML Access went from ~= 1 seek (best possible) to ~= 2 seeks + finding r’ in the index. –If index is O(1Mb) then finding r’ is also like 1 seek –3 seeks per random access in a Gb What if the index is still large? –Build (the same sort of index) for the index! –Now we’re paying 4 seeks for each random access into a Tb –Suppose we do this recursively… B-tree structure Good for access, but bad for inserting and deleting records! B-tree structure Good for access, but bad for inserting and deleting records!

11 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Where does the latency come from? Best case scenario occurs when the data is in the same sector or block Data can be spread among non-adjacent blocks/sectors How to reduce the cost of access?

12 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Using relational DB for Big ML Counts are stored on disk and not memory: –Access times O(n*scan)  O(n*scan*4*seek) DBs are good at caching frequently-used values –Seeks may be infrequent However, with random access, the cost of access and updating (deletion/insertion) can be prohibitive!

13 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Counting example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Hash table, database, etc “increment C[x] by D” Hashtable issue: memory is too small Database issue: seeks are slow

14 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Hash table1 “increment C[x] by D” Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Now we have enough memory…. New issues: Machines and memory cost $$! Routing increment requests to right machine Sending increment requests across the network Communication complexity New issues: Machines and memory cost $$! Routing increment requests to right machine Sending increment requests across the network Communication complexity

15 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Using a memory-distributed database Distributed machines now hold the counts: –Extra costs in communication O(n) [it’s okay]: O(n*scan)  O(n*scan + n*send/rcv) –Complexity: routing needs to be done correctly Distributing data in memory across machines is not as cheap as accessing memory locally because of communication costs. The problem we’re dealing with is not size. It’s the interaction between size and locality: we have a large structure that’s being accessed in a non-local way. Distributing data in memory across machines is not as cheap as accessing memory locally because of communication costs. The problem we’re dealing with is not size. It’s the interaction between size and locality: we have a large structure that’s being accessed in a non-local way.

16 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration What other options do we have… Compress the counter hash-table –use integers as keys instead of strings? –use approximate counts? –discard infrequent/unhelpful words? Trade off time or space: –If the counter updates are better ordered, we can avoid using the disk

17 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large Vocabulary Naïve Bayes / Counting Assume we have K times as much memory as we have How? Construct a hash function h(event) For i=0,…,K-1: Scan thru the train dataset Increment counters for event only if h(event) mod K == i Save this counter set to disk at the end of the scan After K scans you have a complete counter set Construct a hash function h(event) For i=0,…,K-1: Scan thru the train dataset Increment counters for event only if h(event) mod K == i Save this counter set to disk at the end of the scan After K scans you have a complete counter set this works for any counting task, not just naïve Bayes What we’re really doing here is organizing our “messages” to get more locality….

18 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration How can we do Naïve Bayes with a large vocabulary? Sort the intermediate counts: –Worst case scenario O(n log n)! –For very large datasets, use merge-sort: sort k subsets that fit in memory Merge results in linear time

19 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large Vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: –C(“Y=ANY”) ++; C(“Y=y”) ++ –Print “Y=ANY += 1” –Print “Y=y += 1” –For j in 1..d: C(“Y=y ^ X=x j ”) ++ Print “Y=y ^ X=x j += 1” Sort the event-counter update “messages” Scan the sorted messages and compute and output the final counter values messages to another component that increments the counters

20 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Hash table1 Increment C[x] by D Hash table2 Machine 1 Machine 2 Machine K... Machine 0 Message-routing logic

21 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Machine A Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machine C Machine B Increment C[x] by D

22 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large-vocabulary Naïve Bayes Create a hashtable C For each example id, y, x 1,….,x d in train: –C(“Y=ANY”) ++; C(“Y=y”) ++ –Print “Y=ANY += 1” –Print “Y=y += 1” –For j in 1..d: C(“Y=y ^ X=x j ”) ++ Print “Y=y ^ X=x j += 1” Sort the event-counter update “messages” –We’re collecting together messages about the same counter Scan and add the sorted messages and output the final counter values Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 …

23 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large-vocabulary Naïve Bayes Y=business+=1 … Y=business ^ X =aaa+=1 … Y=business ^ X=zynga +=1 Y=sports ^ X=hat+=1 Y=sports ^ X=hockey+=1 … Y=sports ^ X=hoe+=1 … Y=sports+=1 … previousKey = Null sumForPreviousKey = 0 For each (event,delta) in input: If event==previousKey sumForPreviousKey += delta Else OutputPreviousKey() previousKey = event sumForPreviousKey = delta OutputPreviousKey() define OutputPreviousKey(): If PreviousKey!=Null print PreviousKey,sumForPreviousKey previousKey = Null sumForPreviousKey = 0 For each (event,delta) in input: If event==previousKey sumForPreviousKey += delta Else OutputPreviousKey() previousKey = event sumForPreviousKey = delta OutputPreviousKey() define OutputPreviousKey(): If PreviousKey!=Null print PreviousKey,sumForPreviousKey Accumulating the event counts requires constant storage … as long as the input is sorted. streaming Scan-and-add:

24 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Large-vocabulary Naïve Bayes For each example id, y, x 1,….,x d in train: –Print Y=ANY += 1 –Print Y=y += 1 –For j in 1..d: Print Y=y ^ X=x j += 1 Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values Complexity: O(n), n=size of train Complexity: O(nlogn) Complexity: O(n) O(|V||dom(Y)|) (Assuming a constant number of labels apply to each document) java MyTrainertrain | sort | java MyCountAdder > model Model size: max O(n), O(|V||dom(Y)|)

25 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Distributed Counting  Stream and Sort Counting example 1 example 2 example 3 …. example 1 example 2 example 3 …. Counting logic Machine A Sort C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Machine C1..CK Machine B Increment C[x] by D Trivial to parallelize! Easy to parallelize! Standardized message routing logic

26 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Part II: Phrase Finding with Request and Answer Pattern

27 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration ACL Workshop 2003

28 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Why phrase-finding? There are lots of phrases There’s not supervised data It’s hard to articulate –What makes a phrase a phrase, vs just an n-gram? a phrase is independently meaningful (“test drive”, “red meat”) or not (“are interesting”, “are lots”) –What makes a phrase interesting?

29 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

30 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration In computer terms (aka learning), what is a good phrase? Phraseness: degree to which a sequence of words “qualify” as a phrase –Statistics: How often do words co-occur vs. occur separately? Informativeness: degree to which the phrase captures or illustrates a key idea within a document relative to a domain: –Statistics: Two aspects of co-occurrence –Background corpus (training) –Foreground corpus (testing)

31 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Phraseness” 1 – based on BLRT Binomial Log-Likelihood Ratio Test (BLRT): –Draw samples: n 1 draws, k 1 successes n 2 draws, k 2 successes Are they from one binominal (i.e., k 1 /n 1 and k 2 /n 2 were different due to chance) or from two distinct binomials? –Define p 1 =k 1 / n 1, p 2 =k 2 / n 2, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k

32 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Phraseness” 1 – based on BLRT Binomial Log-Likelihood Ratio Test (BLRT): –Draw samples: n 1 draws, k 1 successes n 2 draws, k 2 successes Are they from one binominal (i.e., k 1 /n 1 and k 2 /n 2 were different due to chance) or from two distinct binomials? –Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k

33 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Phraseness” 1 – based on BLRT –Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k comment k1k1 C(W 1 =x ^ W 2 =y)how often bigram x y occurs in corpus C n1n1 C(W 1 =x)how often word x occurs in corpus C k2k2 C(W 1 ≠x^W 2 =y)how often y occurs in C after a non-x n2n2 C(W 1 ≠x)how often a non-x occurs in C Phrase x y: W 1 =x ^ W 2 =y Does y occur at the same frequency after x as in other positions?

34 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration “Informativeness” 1 – based on BLRT –Define p i =k i /n i, p=(k 1 +k 2 )/(n 1 +n 2 ), L(p,k,n) = p k (1-p) n-k Phrase x y: W1=x ^ W2=y and two corpora, C and B comment k1k1 C(W 1 =x ^ W 2 =y)how often bigram x y occurs in corpus C n1n1 C(W 1 =* ^ W 2 =*)how many bigrams in corpus C k2k2 B(W 1 =x^W 2 =y)how often x y occurs in background corpus n2n2 B(W 1 =* ^ W 2 =*)how many bigrams in background corpus Does x y occur at the same frequency in both corpora?

35 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Phrase finding, with BLRT

36 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Intuition of phraseness and informativeness Goal: distinguish between distributions and quantify how different they are Phraseness: estimate x y with bigram model or unigram model Informativeness: estimate with foreground vs background corpus Use point-wise KL Divergence

37 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Phraseness: Defining our metric Using KL Divergence Difference between bigram and unigram language model in foreground –Bigram model: P(x y)=P(x)P(y|x) –Unigram model: P(x y)=P(x)P(y)

38 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Informativeness: Defining our metric Using KL Divergence Why not combine phraseness & informativeness? or phraseness informativeness

39 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Phrase finding with Pointwise KL Divergence combined

40 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Advantages of using point KL Divergence over BLRT BLRT scores frequency in foreground and background symmetrically –Point KL divergence does not! Phraseness and Informativeness are comparable: –Combination is straightforward; –Don’t need a classifier Language modeling: –Extensions to n-grams, smoothing methods –can modularize the software

41 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Now, how do we implement it? Request and Answer Pattern Main data structure: tables of key-value pairs –key: is a phrase x y –value: mapping from attribute names Send messages to this “table” and get results back KeyValue old manfreq(B) = 10, freq(F)=13, I = 1.3, P = 740 bad servicefreq(B) = 8, freq(F) = 25, I = 560, P = 254

42 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration From the foreground, generating and scoring phrases Count events with “W 1 =x and W 2 =y”: –stream and sort approach in naïve Bayes Stream through the output: –convert to {phrase, attributes-of-phrase} records –Only one attribute freq_in_F=n Stream through foreground and count events “W1=x” in a (memory-based) hashtable Compute phrasiness*. Scan through the phrase table to add an extra attribute to hold word frequencies

43 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration From the background, count events Count events with “W 1 =x and W 2 =y”: –stream and sort approach in naïve Bayes Stream through the output: –convert to {phrase, attributes-of-phrase} records –Only one attribute freq_in_B=n Sort the two phrase tables: freq_in_B and freq_in_F and run the output through another reduce –Append together attributes with the same key

44 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Putting things together! Scan the phrase table once more: –Add informativeness scores together Assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records – of course m C << n C Compute phrasiness: O(m C ) Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m) Assuming word vocabulary n W is small: Scan foreground corpus C for phrases: O(n C ) producing m C phrase records – of course m C << n C Compute phrasiness: O(m C ) Scan background corpus B for phrases: O(n B ) producing m B Sort together and combine records: O(m log m), m=m B + m C Compute informativeness and combined quality: O(m) Assumes word counts fit in memory

45 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Can we do better? Keeping word counts out of memory Assume we have word and phrase tables: –how to incorporate word records with phrases? For each phrase x y request word counts: –Print ‘x ~ request = freq-in-F, from=xy’ –Print y ~ request = freq-in-F, from=xy’ Sort all the word requests in the word tables Scan through the result and generate the answers: for each word w, a 1 =n 1,a 2 =n 2,…. –Print “xy ~request=freq-in-C,from=w” Sort and augment xy records appropriately

46 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Final Analysis Scan foreground corpus F for phrases, words: O(n F ) –producing m F phrase records, v F word records Scan phrase records producing word-freq requests: O(m F ) –producing 2m F requests Sort requests with word records: O((2m F + v F )log(2m F + v F )) = O(m F log m F ) since v F < m F Scan through and answer requests: O(m F ) Sort answers with phrase records: O(m F log m F ) Repeat 1-5 for background corpus: O(n B + m B logm B ) Combine the two phrase tables: O(m log m), m = m B + m C Compute all the statistics: O(m)

47 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Part III: Rocchio’s Algorithm

48 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Naïve Bayes is unusual Only one pass through the data We don’t necessarily care about the order of the data Rocchio’s algorithm: –Instead of “binary” or “counts” for words, we will use a vector space model –Then classify! Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.

49 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Vector Space Model for Documents Three aspects are needed: Word Weighting method: we will use term frequency model Document length normalization: we will use Euclidean vector length Similarity Measure: we will use Cosine Similarity

50 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio’s algorithm Many variants of these formulae …as long as u(w,d)=0 for words not in d! Store only non-zeros in u(d), so size is O(|d| ) But size of u(y) is O(|n V | ) D is total no. of documents

51 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio’s algorithm Given a table mapping w to DF(w), we can compute v(d) from the words in d…and the rest of the learning algorithm is just adding…

52 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio v Bayes id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … 5245 1054 2120 37 3 … Train data Event counts Recall Naïve Bayes test process? Imagine a similar process but for labeled documents… id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 C[X=w 1,1 ^Y=sports]=5245, C[X=w 1,1 ^Y=..],C[X=w 1,2 ^…] id 2 y2 w 2,1 w 2,2 w 2,3 ….C[X=w 2,1 ^Y=….]=1054,…, C[X=w 2,k2 ^…] id 3 y3 w 3,1 w 3,2 ….C[X=w 3,1 ^Y=….]=… id 4 y4 w 4,1 w 4,2 … …

53 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Rocchio Summary Compute DF –one scan thru docs Compute v(id i ) for each document –output size O(n) Add up vectors to get v(y) Classification ~= disk NB time: O(n), n=corpus size –like NB event-counts time: O(n) –one scan, if DF fits in memory –like first part of NB test procedure otherwise time: O(n) –one scan if v(y)’s fit in memory –like NB training otherwise

54 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Summary Three approaches to work with Big “Streaming” Data: –Stream and Sort –Request and Answer –Rocchio’s Algorithm Implementing phrase finding: –Correlations between words for identifying phrases Where can these be used? –NLP Related tasks: Complex name-entity resolution –Other counting applications!

55 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration More about Data Streams…

56 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Streaming Data Data streams are “infinite” and “continuous” –Distribution can evolve over time Managing streaming data is important: –Input is coming extremely fast! –We don’t see all of the data –(sometimes) input rates are limited (e.g., Google queries, Twitter and Facebook status updates) Key question in streaming data: –How do we make critical calculations about the stream using limited (secondary) memory?

57 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Is Naïve Bayes a Streaming Algorithm? Yes! In Machine Learning, we call this Online Learning –Allows to model as the data becomes available –Adapts slowly as the data streams in How are we doing this? –NB: slow updates to the model –Learn from the training data –For every example in th stream, we update it (learning rate)

58 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration General Stream Processing Model J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Processor Limited Working Storage... 1, 5, 2, 7, 0, 9, 3... a, r, v, t, y, h, b... 0, 0, 1, 0, 1, 1, 0 time Streams Entering. Each is stream is composed of elements/tuples Ad-Hoc Queries Output Archival Storage Standing Queries

59 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Types of queries on Data Streams Sampling from a data stream: –construct a random sample Queries over a sliding window: –Number of items of type x in the last k elements of the stream Filtering a data stream: –Select elements with property x from the stream Counting distinct element: –Number of distinct elements in the last k elements of the stream Estimating Moments Finding frequent items

60 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Applications Finding relevant products using clickstreams –Amazon, Google, all of them want to find relevant products to shop based on user interactions Trending topics on Social media Health monitoring using sensor networks Monitoring molecular dynamics simulations A. Ramanathan, J.-O. Yoo and C.J. Langmead, On-the-fly identification of conformational sub-states from molecular dynamics simulations. J. Chem. Theory Comput. (2011). Vol. 7 (3): 778- 789. A. Ramanathan, P. K. Agarwal, M.G. Kurnikova, and C. J. Langmead, An online approach for mining collective behaviors from molecular dynamics simulations, J. Comp. Biol. (2010). Vol. 17 (3): 309-324.

61 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Sampling from a Data stream Practical consideration: we cannot store all of the data, but we can store a sample! Two problems: 1.Sample a fixed proportion of elements in the stream (say 1 in 10) 2.Maintain a random sample of fixed size over a potentially infinite stream –At any time k, we would like a random sample of s elements Desirable Property: For all time steps k, each of the k elements seen so far have equal probability of being sampled…

62 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Sampling a fixed proportion Scenario: Let’s look at a search engine query stream –Stream: –Why is this interesting? –Answer questions like “how often did a user have to run the same query in a single day?” –We have only storage = 1/10 th of the query stream Simple solution: –Generate a random integer in [0..9] for each query –Store the query if the integer is 0, otherwise discard

63 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration What is wrong with our simple approach? Suppose each user issues x queries and d queries twice = (x + 2d) total queries What fraction of queries by an average search engine user are duplicates? if we keep only 10% of the queries: –Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once –But only d/100 pairs of duplicates d/100 = 1/10 ∙ 1/10 ∙ d –Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d –Fraction appearing twice is (d/100) divided by (d/100 + s/10 + 18d/100): d/(10s + 19d) d/(x+d)

64 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Instead of queries, let’s sample users… Pick 1/10 th of users and take all their searches in the sample User a hash function that hashes the user name or user id uniformly into 10 buckets Why is this a better solution?

65 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Generalized Solution Stream of tuples with keys: –Key is some subset of each tuple’s components e.g., tuple is (user, search, time); key is user –Choice of key depends on application To get a sample of a/b fraction of the stream: –Hash each tuple’s key uniformly into b buckets –Pick the tuple if its hash value is at most a 65 Hash table with b buckets, pick the tuple if its hash value is at most a. How to generate a 30% sample? Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets

66 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Maintaining a fixed size sample Suppose we need to maintain a random sample S of size exactly s tuples –E.g., main memory size constraint Why? Don’t know length of stream in advance Suppose at time n we have seen n items –Each item is in the sample S with equal prob. s/n How to think about the problem: say s = 2 Stream: a x c y z k c d e g… At n= 5, each of the first 5 tuples is included in the sample S with equal prob. At n= 7, each of the first 7 tuples is included in the sample S with equal prob. Impractical solution would be to store all the n tuples seen so far and out of them pick s at random

67 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Fixed Size Sample Algorithm (a.k.a. Reservoir Sampling) –Store all the first s elements of the stream to S –Suppose we have seen n-1 elements, and now the n th element arrives (n > s) With probability s/n, keep the n th element, else discard it If we picked the n th element, then it replaces one of the s elements in the sample S, picked uniformly at random Claim: This algorithm maintains a sample S with the desired property: –After n elements, the sample contains each element seen so far with probability s/n

68 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Proof: By Induction We prove this by induction: –Assume that after n elements, the sample contains each element seen so far with probability s/n –We need to show that after seeing element n+1 the sample maintains the property Sample contains each element seen so far with probability s/(n+1) Base case: –After we see n=s elements the sample S has the desired property Each out of n=s elements is in the sample with probability s/s = 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

69 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Proof: By Induction Inductive hypothesis: After n elements, the sample S contains each element seen so far with prob. s/n Now element n+1 arrives Inductive step: For elements already in S, probability that the algorithm keeps it in S is: So, at time n, tuples in S were there with prob. s/n Time n  n+1, tuple stayed in S with prob. n/(n+1) So prob. tuple is in S at time n+1 = (n/n+1) (s/n) = s/n+1 Element n+1 discarded Element n+1 not discarded Element in the sample not picked

70 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Take home messages … Part of ML is good grasp of theory Part of ML is a good grasp of what hacks tend to work These are not always the same –Especially in big-data situations Techniques covered so far: –Brute-force estimation of a joint distribution –Naive Bayes –Stream-and-sort, request-and-answer patterns –BLRT and KL-divergence (and when to use them) –TF-IDF weighting – especially IDF it’s often useful even when we don’t understand why

Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.

Similar presentations

Presentation on theme: "Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational.

Similar presentations

Presentation on theme: "Naïve Bayes with Large Feature sets: (1)Stream and Sort (2)Request and Answer Pattern (3) Rocchio’s Algorithm COSC 526 Class 4 Arvind Ramanathan Computational."— Presentation transcript:

Similar presentations

About project

Feedback