Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at

Infinite Data: Data Streams
Until now, we assumed that the data is finite Data is crawled and stored Queries are issued against index However, sometimes data is infinite (is constantly being created) and therefore becomes too big to be stored Examples of data streams? Later: How can we query such data?

Sensor Data Millions of sensors
Each sending data every (fraction of a) second Analyze trends over time LOBO Observatory

Image Data Satellites send terabytes a day of data
Cameras have lower resolution, but there are more of them…

Internet and Web Traffic
Streams of HTTP requests, IP Addresses, etc, can be used to monitor traffic for suspicious activity Streams of Google queries can be used to determine query trends, or to understand disease spread Can the spread of the flu be predicted accurately by Google queries?

Social Updates Twitter Facebook Instagram Flickr …

The Stream Model Tuples in the stream enter at a rapid rate
Have a fixed working memory size Have a fixed (but larger) archival storage How do you make critical calculations quickly, i.e., using only working memory?

The Stream Model: More details
Ad-Hoc Queries Processor Standing Queries , 5, 2, 7, 0, 9, 3 a, r, v, t, y, h, b , 0, 1, 0, 1, 1, 0 time Streams Entering: Each stream composed of tuples / elements Output Archival Storage Limited Working Storage

Are these easy or hard to answer?
Stream Queries Standing queries versus Ad-hoc Usually store a sliding window for Ad-hoc queries Examples Output alert when element > 25 appears Compute a running average of the last 20 elements Compute a running average of the elements from the past 48 hours Maximum value ever seen Compute the number of different elements seen Are these easy or hard to answer?

Constraints and Issues
Data appears rapidly Elements must be processed in real time (= in main memory), or are lost We may be willing to get an approximate answer, instead of an exact answer, to save time Hashing will be useful as it adds randomness to algorithms!

Sampling Data in a Stream

Goal Create a sample of the stream
Answer queries over the subset and have it be representative of the entire stream Two Different Problems: Fixed proportion of elements (e.g., 1 in 10) Random sample of fixed size (e.g., at any time k, we should have a random sample of s elements)

Motivating Example Search engine stream of tuples
(user, query, time) We have room to store 1/10 of the stream Query: What fraction of the typical user’s queries were repeated over the past month Ideas?

Solution Attempt For each tuple, choose a random number from 0 to 9
Store tuples for which random value was 0 On average, per user, we store 1/10 of queries Can this be used to answer our question on repeated queries?

Wrong! Suppose in past month, a user issued x queries once and d queries twice (in total x+2d queries) Correct answer: d/(x+d) We have a 10% sample, so we have x/10 of the singleton queries issued and 2d/10 of duplicate queries (at least once) d/100 pairs of duplicates Of d duplicates, 18d/100 appear once 18d/100 = ((1/10 * 9/10) + (9/10 * 1/10)) * d We will give the following wrong answer:

Solution Pick 1/10 of users, and take all their searches in sample
Use a hash function that hashes the user name (or user id) into 10 buckets Store data if user is hashed into bucket 0 How would you store a fixed fraction a/b of the stream?

Variant of the Problem Store a random sample of fixed size (e.g., at any time k, we should have a random sample of s elements) Each of the k elements seen so far have the same probability to be in the sample Each of the k elements seen so far should have probability s/k to be in the sample Ideas?

Reservoir Sampling Store all the first s elements of stream to S Suppose we have seen n-1 elements and now the nth arrives (n > s) With probability s/n, keep the nth element, else discard If we keep the nth element then randomly choose one of the elements already in S to discard Claim: After n elements, the sample contains each element seen so far with probability s/n

Proof By Induction Inductive Claim: If, after n elements, the sample S contains each element seen so far with probability s/n, then, after n+1 elements, the sample S contains each element with probability s/(n+1) Base Case: n=s, each element of the first n has probability s/n = 1 to be in sample

Proof By Induction (cont)
Inductive Step: For elements already in sample, the probability that they remain is: At time n, tuple is in sample with prob s/n, so at time n+1 its probability to now be in sample is s/n+1 Discard element n+1 Keep element n+1 Keep element in sample

Filtering Stream Data

The Problem We would like to filter the stream according to some critereon If simple property of tuple (e.g., <10): Easy! If requires look up in a large set that does not fit in memory: Hard! Example: Stream of URLs (for crawling). Filter out if “seen” already Example: Stream of s. We have addresses believed to be non-spam. Filter out “spam” (perhaps for further processing)

Filtering Example Suppose we have 1 billion “allowed” email addresses
Each address takes ~ 20 bytes We have 1GB available main memory We cannot store a hash table of allowed s in main memory! View main memory as a bit array We have 8 billion bits available

Simple Filtering Solution
Let h be a hash table from addresses to 8 billion Hash each allowed , and set corresponding bit to 1 There are 1 billion addresses, so about 1/8 of bits will be 1 (perhaps less, due to hash collisions!) Given a new , compute hash. If value is 1, let through. If 0, consider it spam There will be false positives! (about 1/8th of spam)

Bloom Filter A Bloom filter consists of:
An array of n bits, all initialized to 0 A collection of hash functions h1,…,hk. Each function returns a value <= n A set S of m key values Goal: Allow all stream elements whose keys are in S. Reject all others.

Can we have false negatives? Can we have false positives?
Using a Bloom Filter To set up the filter: Take each key K in S and set hi(K) to 1, for all hash functions h1,…,hk To use the filter: Given a new key K in stream, compute hi(K), for all hash functions h1,…,hk If all of these are 1, allow element. If at least one of these bits is 0, reject element Can we have false negatives? Can we have false positives?

Probability of a False Positive
Model: Throwing darts at targets Suppose we have x targets and y darts Any dart is equally likely to hit any target How many targets will be hit at least once? Probability that a given dart will not hit a given target: (x-1)/x Probability that none of the darts will hit a given target:

Probability of a False Positive (cont)
As x goes to infinity, it is well known that So, if x is large, we have

Back to the Running Example
There are 1 billion addresses (=1 billion darts) There are 8 billion bits (=8 billion targets) Meanwhile, 1 hash function Probability of a bit to not be hit: Probability of a bit to be hit is approx

Now, multiple hash functions
Suppose we use k hash functions Number of targets is n Number of darts is km Probability that a bit remains 0 is Choose k so as to have enough 0 bits Probability of false positive is now

Counting Distinct Stream Elements

Problem The stream elements are from some universal set
Estimate the number of different elements we have seen from the beginning or from specific point in time Why is this hard?

Applications How many different words in Web pages at a site?
Why is this interesting? How many different Web pages does a customer request a week? How many distinct products have we sold last month?

Obvious Solution Keep all elements seen so far in a hash table
When element appears, check in hashtable But, what if too many elements to store in hashtable? Or too many streams to store each in hashtable? Solution: Use a small amount of memory, and estimate the correct value Limit probability that the error is large

Flajolet-Martin Algorithm: Intuition
We will hash elements of stream to a bit-string Must choose bit stream length to have more possible bit streams than members of universal set Example: IP addresses are a universal set of size 4*8 bits Need a string of length at least 32 bits The more elements, the more likely we will see “unusual” hash values

Flajolet-Martin Algorithm: Intuition
Pick hash function h that maps universe of N elements to at least log2N bits For each stream element a, let r(a) be the number of training 0-s in h(a) r(a) = position of the first 1 counting from the right Example: h(a) = 12, then 12 is 1100 in binary and r(a)=2 Use R to denote the maximum tail length seen so far Estimate the number of distinct elements as 2R

Why it Works: Very rough and heuristic
h(a) hashes a with equal prob. to any of N values Then, h(a) is a sequence of log2N bits, where 2-r fraction of all a-s have a tail of r zeros About 50% of a-s hash to ***0 About 25% of a-s hash to **00 So, if the longest tail is R = 2, then we have probably seen about 4 distinct items so far So, we need to hash about 2r items before we see one with a zero-suffix of length r

Why it Works: More formally
Let m be the number of distinct items We will show that the probability of finding a tail of r zeros: Goes to 1 if m is much greater than 2r Goes to 0 if m is much smaller than 2r So, 2R will be around m!

What is the probability that a given h(a) ends in at least r 0-s? h(a) hashes elements uniformly at random Probability that a random number ends in at least r 0-s is 2-r The probability of not seeing a tail of length r among m elements: (1-2-r)m

Probability of not finding a tail of length r is: If m<<2r, then probability tends to 1 So, the probability of finding a tail of length r tends to 0 If m>>2r, then probability tends to 0 So, the probability of finding a tail of length r tends to 1 So, 2R will be around m!

Why it Doesn’t Work E[2R] is infinite!
Probability halves when R→R+1, but value doubles Work around uses many hash functions hi and many samples of Ri Details omitted

Counting Ones in a Window

Problem We have a window of length N on a binary stream
N is too big to store At all times we want to be able to answers “How many 1-s are there in the last k bits?” (k≤N) Example: For each spam mail seen we emitted a 1. Now, we want to always know how many of the last million s were spam Example: For each tweet seen we emitted a 1 if it is anti-Semitic. Now, we want to always know how many of the billion tweets were anti-Semitic

Cost of Exact Counts For a precise count, we must store all N bits of the window Easy to show even if we can only ask about number of 1-s in entire window Suppose we use j<N bits to store information There are 2 different bit sequences that are represented in the same manner. Suppose that the sequences agree on their last k-1 bits, but differ on the kth Follow the window by N-k 1-s We remain with the same representation for the two different strings, but they must have a different number of 1-s!

The Datar-Gionis-Indyk-Motwani Algorithm (DGIM Algorithm)
Use O(log2N) bits and get an estimate that is no more than 50% Later improve to get a better estimate Assume: each bit has a timestamp (i.e., position in which it arrives) Represent timestamp modulo N – require log(N) bits to represent Store the total number of bits ever seen, modulo N

DGIM Algorithm Divide window into buckets consisting of
The timestamp of its right (most recent) end Size of bucket = Number of 1-s in bucket. Number must be a power of 2 Bucket representation: log(N) for timestamp + log(log(N)) for size We know that size i is some 2j, so only store j. j is at most log(N) and needs log(log(N)) bits

Rules for Representing a Stream by Buckets
Right end of a bucket always has a 1 Every position with 1 is in some bucket No position is in more than one bucket There are one or two buckets of any given size, up to some maximum size All sizes must be a power of 2 Buckets cannot decrease in size as we move to the left

Example: Bucketized Stream
… 0ne of size 8 Two of size 4 One of size 2 Two of size 1 Observe that all of the DGIM rules hold

Space Requirements Each bucket requires O(lg N) bits
We saw this earlier For a window of length N, there can be at most N 1-s. If the largest bucket is of size 2j, then j cannot be more than log N There are at most 2 buckets of all sizes from logN to 1, i.e., O(log(N)) buckets Total space for buckets: O(log2(N))

Query Answering Given k≤N, we want to know how many of the last k bits were 1-s. Find bucket b with earliest timestamp that includes at least some of the last k bits Estimate number of 1-s as the sum of sizes of buckets to the right of b plus half of the size of b

Example What would be your estimate if k = 10? What if k=15?
… 0ne of size 8 Two of size 4 One of size 2 Two of size 1 What would be your estimate if k = 10? What if k=15?

How Good an Estimate is This?
Suppose that the leftmost bucket b included has size 2j. Let c be the correct answer. Suppose the estimate is less than c: In the worst case, all the 1-s in the leftmost bucket are in the query range. So, the estimate misses half of bucket b, i.e., 2j-1. Then, c is at least 2j. Actually c is at least 2j+1-1 since there is at least one bucket of each size 2j-1,2j-2,…,1. So, our estimate is at least 50% of c

How Good an Estimate is This?
Suppose that the leftmost bucket b included has size 2j. Let c be the correct answer. Suppose the estimate is more than c: In the worst case, only the rightmost 1 in the leftmost bucket is in the query range, and there is only one of each bucket size less than 2j Then, c = 1 + 2j-1+2j-2 +…+ 1=2j Our estimate was 2j-1+ 2j-1+2j-2 +…+ 1=2j +2j-1-1 So, our estimate is at most 50% more than c

Maintaining DGIM Conditions
Suppose we have a window of length N represented by buckets satisfying DGIM conditions. Then a new bit comes in: Check the leftmost bucket. If its timestamp is not currentTimestamp – N, the remove If new bit is 0, do nothing If new bit is 1, create a new bucket of size 1 If there are now only 2 buckets of size 1, stop Otherwise, merge previous buckets of size 1 into bucket of size 2 If there are now only 2 buckets of size 2, stop Otherwise, merge previous buckets of size 2 into buckets of size 4 Etc … Time: At most log(N), since there are log(N) different sizes

Example: Updating Buckets
Slide by Jure Leskovec: Mining Massive Datasets

Example What happens if the next 3 bits are 1,1,1?
… 0ne of size 8 Two of size 4 One of size 2 Two of size 1 What happens if the next 3 bits are 1,1,1?

Reducing the Error Instead of allowing 1 or 2 of each bucket size, allow r-1 or r of each bucket size for sizes 1, 2, 4, 8, … (and an integer r>2) Of the smallest and largest size present, we allow there to be any number of buckets, from 1 to r Use similar propagation algorithm to that of before Buckets are smaller, so there is tighter bound on error Can prove that the error is at most 1/r

Counting Frequent Items

Problem Given a stream, which items are currently popular
E.g., popular movie tickets, popular items being sold in Amazon, etc. appear more than s times in the window Possible solution Stream per item; at each timepoint “1” if the item appears in the original stream and “0” otherwise Use DGIM to estimate counts of 1 for each item

Problem with this approach?
Example Problem with this approach? Original Stream: 1, 2, 1, 1, 3, 2, 4 Too many streams! Stream for 1: 1, 0, 1, 1, 0, 0, 0 Stream for 2: 0, 1, 0, 0, 0, 1, 0 Stream for 3: 0, 0, 0, 0, 1, 0, 0 Stream for 4: 0, 0, 0, 0, 0, 0, 1

Exponentially Decaying Windows
A heuristic for selecting frequent items Gives more weight to more recent popular items Instead of computing count in the last N elements, compute smooth aggregation over entire stream If stream is a1,a2… and we are taking the sum over the stream, the answer at time t is defined as where c is a tiny constant ~ 10-6

Exponentially Decaying Windows (cont)
If stream is a1,a2… and we are taking the sum over the stream, the answer at time t is defined as where c is a tiny constant ~ 10-6 When new at+1 arrives, we (1) multiply current sum by (1-c) and (2) add at+1

Example Suppose c = 10-6 What is the running score for each stream?
Stream for 1: 1, 0, 1, 1, 0, 0, 0 Stream for 4: 0, 0, 0, 0, 0, 0, 1 Suppose c = 10-6 What is the running score for each stream?

Back to Our Problem For each different item x, we compute the running score of the stream defined by the characteristic function of the item Stream in which there is 1 when item appears and 0 otherwise ix=1 if ai=x and 0 otherwise

Retaining the Running Scores
Each time we see some item x in the stream: Multiply all running counts by (1-c) Add 1 for the running sum corresponding to x (create a new running score with value 1 if there is none) How much memory do we need for running scores???

Property of Decaying Windows
Remember, for each item x, we have a running score Summing over all running scores we get

Memory Requirements Suppose we want to find items with weight greater than ½ Since sum of all scores is 1/c, there cant be more than 2/c items with weight ½ or more! So, 2/c is a limit on the number of scores being counted at any time For other weight requirements, we would get a different bound, in a similar manner Think about it: How would you choose c?

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/

Similar presentations

Presentation on theme: "Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/

Similar presentations

Presentation on theme: "Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/"— Presentation transcript:

Similar presentations

About project

Feedback