Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28th VLDB Conference , 2002 2019/1/1 報告人:吳建良
Motivation In some new applications, data come as a continuous “stream” The sheer volume of a stream over its lifetime is huge Response times of queries should be small Examples: Network traffic measurements Market data
Network Traffic Management ALERT: RED flow exceeds 1% of all traffic through me, check it!!! Frequent Items: Frequent Flow identification at IP router short-term monitoring long-term management
Mining Market Data … Frequent Itemsets at Supermarket store layout Among 100 million records: (1) at least 1% customers buy both beer and diaper at same time (2) 51% customers who buy beer also buy diaper! … Frequent Itemsets at Supermarket store layout catalog design …
Challenges Single pass Limited Memory (network management) Enumeration of itemsets (mining market Data)
General Solution (Approximate) Answer Stream Processing Engine Summary in Memory Data Streams
Approximate Algorithm Propose two algorithms for frequent item Sticky Sampling Lossy Counting Propose one algorithm for frequent itemset Extended Lossy Counting for frequent itemsets
Property of proposed algorithm All item(set)s whose true frequency exceeds sN are output No item(set) whose true frequency is less than is output Estimated frequencies are less than the true frequencies by at most
Sticky Sampling Algorithm User input includes three parameters, namely: Support threshold s Error parameter Probability of failure Counts are kept in a data structure S Each entry in S is in the form (e,f), where: e is the item f is the estimated frequency of e in the stream When queried about the frequent items, all entries (e,f) such that f (s - )N N denote the current length of the stream
Sticky Sampling Algorithm (cont’d) Example Empty S Stream
Sticky Sampling Algorithm (cont’d) S ; N 0; t 1/ log (1/s); r 1 e next item; N N + 1 if (e,f) exists in S do increment the count f else if random(0,1) > 1/r do insert (e,1) to S endif if N = 2t 2n do r 2r Prune(S); Goto 2; S: The set of all counts e: item N: Curr. len. of stream r: Sampling rate t: 1/ log (1/s) Prune S的時機: at sampling rate change
Sticky Sampling Algorithm: Prune S function Prune(S) for every entry (e,f) in S do while random(0,1) < 0.5 and f > 0 do f f – 1 if f = 0 do remove the entry from S endif
Lossy Counting Algorithm Incoming data stream is conceptually divided into buckets of w=1/ transactions Current bucket id denote as bcurrent = N/w fe: the true frequency of e in the stream Counts are kept in a data structure D Each entry in D is in the form (e, f, ), where: e is the item f is the estimated frequency of e in the stream is the maximum possible error in f
Lossy Counting Algorithm (cont’d) Example: =0.2, w=5, N=17, bcurrent=4 Bucket 1 Bucket 2 Bucket 3 bcurrent= 4 A B C A B E A C C D D A B E D F C D D D D (A,2,0) (B,2,0) (C,1,0) (A,3,0) (B,2,0) (C,2,1) (E,1,1) (D,1,1) (A,4,0) (B,1,2) (C,2,1) (D,2,2) (E,1,2) (A,4,0) (C,1,3) (D,2,2) (F,1,3) Prune D Prune D Prune D D D D (A,2,0) (B,2,0) (A,3,0) (C,2,1) (A,4,0) (D,2,2)
Lossy Counting Algorithm (cont’d) D ; N 0 w 1/; bcurrent 1 e next item; N N + 1 if (e,f,) exists in D do f f + 1 else do insert (e,1, bcurrent-1) to D endif if N mod w = 0 do prune(D, bcurrent); bcurrent bcurrent + 1 Goto 3; D: The set of all counts N: Curr. len. of stream e: item w: Bucket width bcurrent: Current bucket id Prune D的時機: at bucket boundary
Lossy Counting Algorithm: prune D function prune(D, bcurrent ) for each entry (e,f,) in D do if f + bcurrent do remove the entry from D endif
Lossy Counting Algorithm (cont’d) Four Lemmas Lemma1: Whenever deletions occur, bcurrent N Lemma2: Whenever an entry (e,f,) gets deleted, fe bcurrent Lemma3: If e does not appear in D, then fe N Lemma4: If (e,f,) D, then f fe f+N
Extended Lossy Counting for Frequent Itemsets Incoming data stream is conceptually divided into buckets of w= 1/ transactions Counts are kept in a data structure D Multiple buckets ( of them say) are processed in a batch Each entry in D is in the form (set, f, ), where: set is the itemset f is the approximate frequency of set in the stream is the maximum possible error in f
Extended Lossy Counting for Frequent Itemsets (cont’d) Bucket 1 Bucket 2 Bucket 3 Put 3 buckets of data into main memory one time
Overview of the algorithm D is updated by the operations UPDATE_SET and NEW_SET UPDATE_SET updates and deletes entries in D For each entry (set, f, ), count occurrence of set in the batch and update the entry If an updated entry satisfies f + bcurrent, the entry is removed from D NEW_SET inserts new entries into D If a set set has frequency f in the batch and set does not occur in D, create a new entry (set, f, bcurrent-)
Implementation Challenges: 3 major modules: Not to enumerate all subsets of a transaction Data structure must be compact for better space efficiency 3 major modules: Buffer Trie SetGen
Implementation (cont’d) Buffer: repeatedly reads in a batch of buckets of transactions, into available main memory Trie: maintains the data structure D SetGen: generates subsets of item-id’s along with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
Example ACE BCD AB ABC AD BCE ACE: AC, A, C BCD: BC, B, C AB: AB, A, B Main Memory bucket3 bucket4 ACE BCD AB ABC AD BCE UPDATE_SET SetGen ACE: AC, A, C BCD: BC, B, C AB: AB, A, B ABC: AB, AC, BC, A, B, C AD: A BCE: BC, B, C D D (A,5,0) (B,3,0) (C,3,0) (D,2,0) (AB,2,0) (AC,3,0) (AD,2,0) (BC,2,0) (A,9,0) (B,7,0) (C,7,0) (AC,5,0) (BC,5,0) NEW_SET Add (AB,2,2) into D
Experiments IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB
Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s
Varying length N and support s Time in seconds S = 0.004 Time in seconds S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s
Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s
Comparison with fast A-priori Our Algorithm with 4MB Buffer Our Algorithm with 44MB Buffer Support Time Memory 0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB 0.002 25 s 53 MB 94 s 10 MB 15 s 0.004 14 s 48 MB 65 s 7MB 8 s 0.006 13 s 46 s 6 MB 6 s 0.008 34 s 5 MB 4 s 0.010 26 s Dataset: IBM T10.I4.1000K with 1M transactions, average size 10.
Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N No of counters Support s = 1% Error ε = 0.1% N (stream length) No of counters Log10 of N (stream length)