Download presentation
Published byDerek Reynolds Modified over 9 years ago
1
Mining of Massive Datasets Ch4. Mining Data Streams
2
Outline What is data stream The stream data model
Example of stream sources Stream queries : standing queries & ad-hoc queries Sampling data in a stream Obtaining a Representative Sample Varying the sample size Filtering streams Bloom filtering
3
What is data stream Database Data stream
Data is available when and if we want it Data stream Data arrives in a stream Stream is composed of elements/tuples If it is not processed immediately, then it is lost forever (0, 7, k),(1,5,n),(0,3,d)
4
The stream data model Streams need not have the same
data rates or data types Working storage : main memory/disk Problem : cannot store all the data from all the streams
5
Example of stream sources
Sensor data Temperature sensor in the ocean Give the sensor a GPS unit : report surface height Image data Satellites Surveillance cameras Internet and web traffic Google – hundred million search queries per day Yahoo! – billion of clicks per day
6
Stream queries Standing queries Ad-hoc queries permanently executing
produce outputs at appropriate times
7
Standing queries
8
Ad-hoc queries A question asked once about the current state of the stream We do not store all streams =>we cannot answer arbitrary queries Solution : store a sliding window Elements or time q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future
9
Example
10
Sampling data in a stream
We cannot store all streams in main memory Solution : to get a approximate answer than an exact solution We ask queries about the sampled data
11
Example Scenario: Search engine query stream Obvious solution:
Stream of tuples: (user, query, time) Answer questions such as: How often did a user run the same query in a single days Wish to store 1/10th of query stream Obvious solution: Generate a random integer in [0...9] for each query Store the query if the integer is 0, otherwise discard This solution is wrong
12
Suppose each user issues x queries once and d queries twice
(total of x+2d queries) Correct answer: d/(x + d) Proposed solution: We keep 10% of the queries Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d So the sample-based answer is 𝑑 𝑥 10 + 𝑑 𝑑 100 = 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅 d/(x + d) ≠ 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅
13
Obtaining a Representative Sample
Pick 1/10th of the users and take all of their searches Store a list of users Generate a random integer between 0 and 9 0 =>value : in ; others =>value : out Hash function Hash each user name to one of ten buckets,0 through 9 1 2 3 4 5 6 7 8 9
14
The general sampling problem
The stream consists of tuples with n components A subset of components are the key If the key consists of more than one component, the hash function needs to combine the values to make a single hash-value Stream of tuples: (user, query, time)
15
Varying the sample size
Because the sample will grow Values 0,1,2,…,B-1 Threshold t We sample the tuples whose key K satisfies h(K) ≦ t Lower t to t-1, if the samples exceeds the allotted space t=4 1 2 3 4 5 6 7 8 9
16
Filtering streams We want to accept those tuples that meet a crierion.
Accept tuples are passed to another process, while others are dropped spam filtering Suppose we have a set S of one billion allowed address address is 20 bytes or more We have 1 GB of available main memory
17
Bloom filtering Use the main memory as a bit array B : B = [0,1,0,0,1,0,1,…,0] 1 GB main memory => 8 billion bits All the bit is 0 in the beginning : B = [0,0,0,0,…,0] Hash each member of S to a bit, and set that bit to 1 : B[h(s)] = 1 Approximately 1/8th of the bits will be 1 When a element a arrives, we hash its address If B[h(a)] == 1, we let it through ; If B[h(a)] == 0, we drop this Approximately 1/8th of the spam will get through
18
Analysis of Bloom filtering
Suppose we have x targets and y darts The probability that a given dart will not hit a given target is (x-1)/x The probability that none of the darts will hit a given target is ((𝑥−1)/𝑥) 𝑦 Rewrite ((𝑥−1)/𝑥) 𝑦 = (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) Because (1−𝜖) 1 𝜖 = 1 𝑒 for small 𝜖 => (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) = 𝑒 − 𝑦 𝑥 The probability of a false positive is 1 - 𝑒 − 𝑦 𝑥
19
Example Consider the spam email
8 billion bits => x = 8× targets 1 billion members of S => y = darts The probability of a 1 is (1 - 𝑒 − 1 8 ) ≒
20
There are k hash functions The number of targets is x = n
Number of hash functions, k False positive prob. Set S has m members The array has n bits There are k hash functions The number of targets is x = n The number of darts is y = km The probability of a 1 is (1 - 𝑒 − 𝑘𝑚 𝑛 ) Optimal” value of k: n/m ln(2)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.