Mining of Massive Datasets Ch4. Mining Data Streams
Outline What is data stream The stream data model Example of stream sources Stream queries : standing queries & ad-hoc queries Sampling data in a stream Obtaining a Representative Sample Varying the sample size Filtering streams Bloom filtering
What is data stream Database Data stream Data is available when and if we want it Data stream Data arrives in a stream Stream is composed of elements/tuples If it is not processed immediately, then it is lost forever (0, 7, k),(1,5,n),(0,3,d)
The stream data model Streams need not have the same data rates or data types Working storage : main memory/disk Problem : cannot store all the data from all the streams
Example of stream sources Sensor data Temperature sensor in the ocean Give the sensor a GPS unit : report surface height Image data Satellites Surveillance cameras Internet and web traffic Google – hundred million search queries per day Yahoo! – billion of clicks per day
Stream queries Standing queries Ad-hoc queries permanently executing produce outputs at appropriate times
Standing queries
Ad-hoc queries A question asked once about the current state of the stream We do not store all streams =>we cannot answer arbitrary queries Solution : store a sliding window Elements or time q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future
Example
Sampling data in a stream We cannot store all streams in main memory Solution : to get a approximate answer than an exact solution We ask queries about the sampled data
Example Scenario: Search engine query stream Obvious solution: Stream of tuples: (user, query, time) Answer questions such as: How often did a user run the same query in a single days Wish to store 1/10th of query stream Obvious solution: Generate a random integer in [0...9] for each query Store the query if the integer is 0, otherwise discard This solution is wrong
Suppose each user issues x queries once and d queries twice (total of x+2d queries) Correct answer: d/(x + d) Proposed solution: We keep 10% of the queries Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d So the sample-based answer is 𝑑 100 𝑥 10 + 𝑑 100 + 18𝑑 100 = 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅 d/(x + d) ≠ 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅
Obtaining a Representative Sample Pick 1/10th of the users and take all of their searches Store a list of users Generate a random integer between 0 and 9 0 =>value : in ; others =>value : out Hash function Hash each user name to one of ten buckets,0 through 9 1 2 3 4 5 6 7 8 9
The general sampling problem The stream consists of tuples with n components A subset of components are the key If the key consists of more than one component, the hash function needs to combine the values to make a single hash-value Stream of tuples: (user, query, time)
Varying the sample size Because the sample will grow Values 0,1,2,…,B-1 Threshold t We sample the tuples whose key K satisfies h(K) ≦ t Lower t to t-1, if the samples exceeds the allotted space t=4 1 2 3 4 5 6 7 8 9
Filtering streams We want to accept those tuples that meet a crierion. Accept tuples are passed to another process, while others are dropped Email spam filtering Suppose we have a set S of one billion allowed email address Email address is 20 bytes or more We have 1 GB of available main memory
Bloom filtering Use the main memory as a bit array B : B = [0,1,0,0,1,0,1,…,0] 1 GB main memory => 8 billion bits All the bit is 0 in the beginning : B = [0,0,0,0,…,0] Hash each member of S to a bit, and set that bit to 1 : B[h(s)] = 1 Approximately 1/8th of the bits will be 1 When a element a arrives, we hash its email address If B[h(a)] == 1, we let it through ; If B[h(a)] == 0, we drop this email Approximately 1/8th of the spam email will get through
Analysis of Bloom filtering Suppose we have x targets and y darts The probability that a given dart will not hit a given target is (x-1)/x The probability that none of the darts will hit a given target is ((𝑥−1)/𝑥) 𝑦 Rewrite ((𝑥−1)/𝑥) 𝑦 = (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) Because (1−𝜖) 1 𝜖 = 1 𝑒 for small 𝜖 => (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) = 𝑒 − 𝑦 𝑥 The probability of a false positive is 1 - 𝑒 − 𝑦 𝑥
Example Consider the spam email 8 billion bits => x = 8× 10 9 targets 1 billion members of S => y = 10 9 darts The probability of a 1 is (1 - 𝑒 − 1 8 ) ≒ 0.1175
There are k hash functions The number of targets is x = n Number of hash functions, k False positive prob. Set S has m members The array has n bits There are k hash functions The number of targets is x = n The number of darts is y = km The probability of a 1 is (1 - 𝑒 − 𝑘𝑚 𝑛 ) Optimal” value of k: n/m ln(2)