Mining of Massive Datasets Ch4. Mining Data Streams

Name: Mining of Massive Datasets Ch4. Mining Data Streams
Uploaded: 2017-09-19T18:58:28+00:00
Duration: PTM9S10
Channel: Derek Reynolds
Description: Mining of Massive Datasets Ch4. Mining Data Streams

Mining of Massive Datasets Ch4. Mining Data Streams

Outline What is data stream The stream data model
Example of stream sources Stream queries : standing queries & ad-hoc queries Sampling data in a stream Obtaining a Representative Sample Varying the sample size Filtering streams Bloom filtering

What is data stream Database Data stream
Data is available when and if we want it Data stream Data arrives in a stream Stream is composed of elements/tuples If it is not processed immediately, then it is lost forever (0, 7, k),(1,5,n),(0,3,d)

The stream data model Streams need not have the same
data rates or data types Working storage : main memory/disk Problem : cannot store all the data from all the streams

Example of stream sources
Sensor data Temperature sensor in the ocean Give the sensor a GPS unit : report surface height Image data Satellites Surveillance cameras Internet and web traffic Google – hundred million search queries per day Yahoo! – billion of clicks per day

Stream queries Standing queries Ad-hoc queries permanently executing
produce outputs at appropriate times

Standing queries

Ad-hoc queries A question asked once about the current state of the stream We do not store all streams =>we cannot answer arbitrary queries Solution : store a sliding window Elements or time q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m q w e r t y u i o p a s d f g h j k l z x c v b n m Past Future

Example

Sampling data in a stream
We cannot store all streams in main memory Solution : to get a approximate answer than an exact solution We ask queries about the sampled data

Example Scenario: Search engine query stream Obvious solution:
Stream of tuples: (user, query, time) Answer questions such as: How often did a user run the same query in a single days Wish to store 1/10th of query stream Obvious solution: Generate a random integer in [0...9] for each query Store the query if the integer is 0, otherwise discard This solution is wrong

Suppose each user issues x queries once and d queries twice
(total of x+2d queries) Correct answer: d/(x + d) Proposed solution: We keep 10% of the queries Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once But only d/100 pairs of duplicates : d/100 = 1/10 ∙ 1/10 ∙ d Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d So the sample-based answer is 𝑑 𝑥 10 + 𝑑 𝑑 100 = 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅 d/(x + d) ≠ 𝒅 𝟏𝟎𝒙+𝟏𝟗𝒅

Obtaining a Representative Sample
Pick 1/10th of the users and take all of their searches Store a list of users Generate a random integer between 0 and 9 0 =>value : in ; others =>value : out Hash function Hash each user name to one of ten buckets,0 through 9 1 2 3 4 5 6 7 8 9

The general sampling problem
The stream consists of tuples with n components A subset of components are the key If the key consists of more than one component, the hash function needs to combine the values to make a single hash-value Stream of tuples: (user, query, time)

Varying the sample size
Because the sample will grow Values 0,1,2,…,B-1 Threshold t We sample the tuples whose key K satisfies h(K) ≦ t Lower t to t-1, if the samples exceeds the allotted space t=4 1 2 3 4 5 6 7 8 9

Filtering streams We want to accept those tuples that meet a crierion.
Accept tuples are passed to another process, while others are dropped spam filtering Suppose we have a set S of one billion allowed address address is 20 bytes or more We have 1 GB of available main memory

Bloom filtering Use the main memory as a bit array B : B = [0,1,0,0,1,0,1,…,0] 1 GB main memory => 8 billion bits All the bit is 0 in the beginning : B = [0,0,0,0,…,0] Hash each member of S to a bit, and set that bit to 1 : B[h(s)] = 1 Approximately 1/8th of the bits will be 1 When a element a arrives, we hash its address If B[h(a)] == 1, we let it through ; If B[h(a)] == 0, we drop this Approximately 1/8th of the spam will get through

Analysis of Bloom filtering
Suppose we have x targets and y darts The probability that a given dart will not hit a given target is (x-1)/x The probability that none of the darts will hit a given target is ((𝑥−1)/𝑥) 𝑦 Rewrite ((𝑥−1)/𝑥) 𝑦 = (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) Because (1−𝜖) 1 𝜖 = 1 𝑒 for small 𝜖 => (1− 1 𝑥 ) 𝑥( 𝑦 𝑥 ) = 𝑒 − 𝑦 𝑥 The probability of a false positive is 1 - 𝑒 − 𝑦 𝑥

Example Consider the spam email
8 billion bits => x = 8× targets 1 billion members of S => y = darts The probability of a 1 is (1 - 𝑒 − 1 8 ) ≒

There are k hash functions The number of targets is x = n
Number of hash functions, k False positive prob. Set S has m members The array has n bits There are k hash functions The number of targets is x = n The number of darts is y = km The probability of a 1 is (1 - 𝑒 − 𝑘𝑚 𝑛 ) Optimal” value of k: n/m ln(2)

Mining of Massive Datasets Ch4. Mining Data Streams

Similar presentations

Presentation on theme: "Mining of Massive Datasets Ch4. Mining Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining of Massive Datasets Ch4. Mining Data Streams

Similar presentations

Presentation on theme: "Mining of Massive Datasets Ch4. Mining Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback