Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Hashing.
Mining Data Streams (Part 1)
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
Indian Statistical Institute Kolkata
ECIV 201 Computational Methods for Civil Engineers Richard P. Ray, Ph.D., P.E. Error Analysis.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
A survey on stream data mining
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
CSE 373 Data Structures Lecture 15
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Mining of Massive Datasets Ch4. Mining Data Streams
Mining of Massive Datasets Ch4. Mining Data Streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Mining Data Streams (Part 1)
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
What about streaming data?
NUMBER SYSTEMS.
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Analysis of Algorithms
Hash table CSC317 We have elements with key and satellite data
Web-Mining Agents Stream Mining
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Streaming & sampling.
Algorithm Analysis CSE 2011 Winter September 2018.
Searching.
Hash functions Open addressing
Mining Data Streams (Part 1)
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Algorithm design and Analysis
Randomized Algorithms CS648
Hashing.
Data Structures Review Session
Range-Efficient Computation of F0 over Massive Data Streams
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Analysis of Algorithms
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
Data Structures Introduction
Data Structures – Week #7
Minwise Hashing and Efficient Search
Analysis of Algorithms
Approximation and Load Shedding Sampling Methods
The Selection Problem.
Analysis of Algorithms
Lecture-Hashing.
Counting Bits.
Presentation transcript:

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/

Infinite Data: Data Streams Until now, we assumed that the data is finite Data is crawled and stored Queries are issued against index However, sometimes data is infinite (is constantly being created) and therefore becomes too big to be stored Examples of data streams? Later: How can we query such data?

Sensor Data Millions of sensors Each sending data every (fraction of a) second Analyze trends over time LOBO Observatory http://www.mbari.org/lobo/Intro.html

Image Data Satellites send terabytes a day of data Cameras have lower resolution, but there are more of them… http://www.defenceweb.co.za/images/stories/AIR/Air_new/satellite.jpg http://en.wikipedia.org/wiki/File:Three_Surveillance_cameras.jpg

Internet and Web Traffic Streams of HTTP requests, IP Addresses, etc, can be used to monitor traffic for suspicious activity Streams of Google queries can be used to determine query trends, or to understand disease spread Can the spread of the flu be predicted accurately by Google queries?

Social Updates Twitter Facebook Instagram Flickr …

The Stream Model Tuples in the stream enter at a rapid rate Have a fixed working memory size Have a fixed (but larger) archival storage How do you make critical calculations quickly, i.e., using only working memory?

The Stream Model: More details Ad-Hoc Queries Processor Standing Queries . . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering: Each stream composed of tuples / elements Output Archival Storage Limited Working Storage

Are these easy or hard to answer? Stream Queries Standing queries versus Ad-hoc Usually store a sliding window for Ad-hoc queries Examples Output alert when element > 25 appears Compute a running average of the last 20 elements Compute a running average of the elements from the past 48 hours Maximum value ever seen Compute the number of different elements seen Are these easy or hard to answer?

Constraints and Issues Data appears rapidly Elements must be processed in real time (= in main memory), or are lost We may be willing to get an approximate answer, instead of an exact answer, to save time Hashing will be useful as it adds randomness to algorithms!

Sampling Data in a Stream

Goal Create a sample of the stream Answer queries over the subset and have it be representative of the entire stream Two Different Problems: Fixed proportion of elements (e.g., 1 in 10) Random sample of fixed size (e.g., at any time k, we should have a random sample of s elements)

Motivating Example Search engine stream of tuples (user, query, time) We have room to store 1/10 of the stream Query: What fraction of the typical user’s queries were repeated over the past month Ideas?

Solution Attempt For each tuple, choose a random number from 0 to 9 Store tuples for which random value was 0 On average, per user, we store 1/10 of queries Can this be used to answer our question on repeated queries?

Wrong! Suppose in past month, a user issued x queries once and d queries twice (in total x+2d queries) Correct answer: d/(x+d) We have a 10% sample, so we have x/10 of the singleton queries issued and 2d/10 of duplicate queries (at least once) d/100 pairs of duplicates Of d duplicates, 18d/100 appear once 18d/100 = ((1/10 * 9/10) + (9/10 * 1/10)) * d We will give the following wrong answer:

Solution Pick 1/10 of users, and take all their searches in sample Use a hash function that hashes the user name (or user id) into 10 buckets Store data if user is hashed into bucket 0 How would you store a fixed fraction a/b of the stream?

Variant of the Problem Store a random sample of fixed size (e.g., at any time k, we should have a random sample of s elements) Each of the k elements seen so far have the same probability to be in the sample Each of the k elements seen so far should have probability s/k to be in the sample Ideas?

Reservoir Sampling Store all the first s elements of stream to S Suppose we have seen n-1 elements and now the nth arrives (n > s) With probability s/n, keep the nth element, else discard If we keep the nth element then randomly choose one of the elements already in S to discard Claim: After n elements, the sample contains each element seen so far with probability s/n

Proof By Induction Inductive Claim: If, after n elements, the sample S contains each element seen so far with probability s/n, then, after n+1 elements, the sample S contains each element with probability s/(n+1) Base Case: n=s, each element of the first n has probability s/n = 1 to be in sample

Proof By Induction (cont) Inductive Step: For elements already in sample, the probability that they remain is: At time n, tuple is in sample with prob s/n, so at time n+1 its probability to now be in sample is s/n+1 Discard element n+1 Keep element n+1 Keep element in sample

Filtering Stream Data

The Problem We would like to filter the stream according to some critereon If simple property of tuple (e.g., <10): Easy! If requires look up in a large set that does not fit in memory: Hard! Example: Stream of URLs (for crawling). Filter out if “seen” already Example: Stream of emails. We have addresses believed to be non-spam. Filter out “spam” (perhaps for further processing)

Filtering Example Suppose we have 1 billion “allowed” email addresses Each email address takes ~ 20 bytes We have 1GB available main memory We cannot store a hash table of allowed emails in main memory! View main memory as a bit array We have 8 billion bits available

Simple Filtering Solution Let h be a hash table from email addresses to 8 billion Hash each allowed email, and set corresponding bit to 1 There are 1 billion email addresses, so about 1/8 of bits will be 1 (perhaps less, due to hash collisions!) Given a new email, compute hash. If value is 1, let email through. If 0, consider it spam There will be false positives! (about 1/8th of spam)

Bloom Filter A Bloom filter consists of: An array of n bits, all initialized to 0 A collection of hash functions h1,…,hk. Each function returns a value <= n A set S of m key values Goal: Allow all stream elements whose keys are in S. Reject all others.

Can we have false negatives? Can we have false positives? Using a Bloom Filter To set up the filter: Take each key K in S and set hi(K) to 1, for all hash functions h1,…,hk To use the filter: Given a new key K in stream, compute hi(K), for all hash functions h1,…,hk If all of these are 1, allow element. If at least one of these bits is 0, reject element Can we have false negatives? Can we have false positives?

Probability of a False Positive Model: Throwing darts at targets Suppose we have x targets and y darts Any dart is equally likely to hit any target How many targets will be hit at least once? Probability that a given dart will not hit a given target: (x-1)/x Probability that none of the darts will hit a given target:

Probability of a False Positive (cont) As x goes to infinity, it is well known that So, if x is large, we have

Back to the Running Example There are 1 billion email addresses (=1 billion darts) There are 8 billion bits (=8 billion targets) Meanwhile, 1 hash function Probability of a bit to not be hit: Probability of a bit to be hit is approx 0.1175

Now, multiple hash functions Suppose we use k hash functions Number of targets is n Number of darts is km Probability that a bit remains 0 is Choose k so as to have enough 0 bits Probability of false positive is now

Counting Distinct Stream Elements

Problem The stream elements are from some universal set Estimate the number of different elements we have seen from the beginning or from specific point in time Why is this hard?

Applications How many different words in Web pages at a site? Why is this interesting? How many different Web pages does a customer request a week? How many distinct products have we sold last month?

Obvious Solution Keep all elements seen so far in a hash table When element appears, check in hashtable But, what if too many elements to store in hashtable? Or too many streams to store each in hashtable? Solution: Use a small amount of memory, and estimate the correct value Limit probability that the error is large

Flajolet-Martin Algorithm: Intuition We will hash elements of stream to a bit-string Must choose bit stream length to have more possible bit streams than members of universal set Example: IP addresses are a universal set of size 4*8 bits Need a string of length at least 32 bits The more elements, the more likely we will see “unusual” hash values

Flajolet-Martin Algorithm: Intuition Pick hash function h that maps universe of N elements to at least log2N bits For each stream element a, let r(a) be the number of training 0-s in h(a) r(a) = position of the first 1 counting from the right Example: h(a) = 12, then 12 is 1100 in binary and r(a)=2 Use R to denote the maximum tail length seen so far Estimate the number of distinct elements as 2R

Why it Works: Very rough and heuristic h(a) hashes a with equal prob. to any of N values Then, h(a) is a sequence of log2N bits, where 2-r fraction of all a-s have a tail of r zeros About 50% of a-s hash to ***0 About 25% of a-s hash to **00 So, if the longest tail is R = 2, then we have probably seen about 4 distinct items so far So, we need to hash about 2r items before we see one with a zero-suffix of length r

Why it Works: More formally Let m be the number of distinct items We will show that the probability of finding a tail of r zeros: Goes to 1 if m is much greater than 2r Goes to 0 if m is much smaller than 2r So, 2R will be around m!

Why it Works: More formally What is the probability that a given h(a) ends in at least r 0-s? h(a) hashes elements uniformly at random Probability that a random number ends in at least r 0-s is 2-r The probability of not seeing a tail of length r among m elements: (1-2-r)m

Why it Works: More formally Probability of not finding a tail of length r is: If m<<2r, then probability tends to 1 So, the probability of finding a tail of length r tends to 0 If m>>2r, then probability tends to 0 So, the probability of finding a tail of length r tends to 1 So, 2R will be around m!

Why it Doesn’t Work E[2R] is infinite! Probability halves when R→R+1, but value doubles Work around uses many hash functions hi and many samples of Ri Details omitted

Counting Ones in a Window

Problem We have a window of length N on a binary stream N is too big to store At all times we want to be able to answers “How many 1-s are there in the last k bits?” (k≤N) Example: For each spam mail seen we emitted a 1. Now, we want to always know how many of the last million emails were spam Example: For each tweet seen we emitted a 1 if it is anti-Semitic. Now, we want to always know how many of the billion tweets were anti-Semitic

Cost of Exact Counts For a precise count, we must store all N bits of the window Easy to show even if we can only ask about number of 1-s in entire window Suppose we use j<N bits to store information There are 2 different bit sequences that are represented in the same manner. Suppose that the sequences agree on their last k-1 bits, but differ on the kth Follow the window by N-k 1-s We remain with the same representation for the two different strings, but they must have a different number of 1-s!

The Datar-Gionis-Indyk-Motwani Algorithm (DGIM Algorithm) Use O(log2N) bits and get an estimate that is no more than 50% Later improve to get a better estimate Assume: each bit has a timestamp (i.e., position in which it arrives) Represent timestamp modulo N – require log(N) bits to represent Store the total number of bits ever seen, modulo N

DGIM Algorithm Divide window into buckets consisting of The timestamp of its right (most recent) end Size of bucket = Number of 1-s in bucket. Number must be a power of 2 Bucket representation: log(N) for timestamp + log(log(N)) for size We know that size i is some 2j, so only store j. j is at most log(N) and needs log(log(N)) bits

Rules for Representing a Stream by Buckets Right end of a bucket always has a 1 Every position with 1 is in some bucket No position is in more than one bucket There are one or two buckets of any given size, up to some maximum size All sizes must be a power of 2 Buckets cannot decrease in size as we move to the left

Example: Bucketized Stream … 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0ne of size 8 Two of size 4 One of size 2 Two of size 1 Observe that all of the DGIM rules hold

Space Requirements Each bucket requires O(lg N) bits We saw this earlier For a window of length N, there can be at most N 1-s. If the largest bucket is of size 2j, then j cannot be more than log N There are at most 2 buckets of all sizes from logN to 1, i.e., O(log(N)) buckets Total space for buckets: O(log2(N))

Query Answering Given k≤N, we want to know how many of the last k bits were 1-s. Find bucket b with earliest timestamp that includes at least some of the last k bits Estimate number of 1-s as the sum of sizes of buckets to the right of b plus half of the size of b

Example What would be your estimate if k = 10? What if k=15? … 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0ne of size 8 Two of size 4 One of size 2 Two of size 1 What would be your estimate if k = 10? What if k=15?

How Good an Estimate is This? Suppose that the leftmost bucket b included has size 2j. Let c be the correct answer. Suppose the estimate is less than c: In the worst case, all the 1-s in the leftmost bucket are in the query range. So, the estimate misses half of bucket b, i.e., 2j-1. Then, c is at least 2j. Actually c is at least 2j+1-1 since there is at least one bucket of each size 2j-1,2j-2,…,1. So, our estimate is at least 50% of c

How Good an Estimate is This? Suppose that the leftmost bucket b included has size 2j. Let c be the correct answer. Suppose the estimate is more than c: In the worst case, only the rightmost 1 in the leftmost bucket is in the query range, and there is only one of each bucket size less than 2j Then, c = 1 + 2j-1+2j-2 +…+ 1=2j Our estimate was 2j-1+ 2j-1+2j-2 +…+ 1=2j +2j-1-1 So, our estimate is at most 50% more than c

Maintaining DGIM Conditions Suppose we have a window of length N represented by buckets satisfying DGIM conditions. Then a new bit comes in: Check the leftmost bucket. If its timestamp is not currentTimestamp – N, the remove If new bit is 0, do nothing If new bit is 1, create a new bucket of size 1 If there are now only 2 buckets of size 1, stop Otherwise, merge previous buckets of size 1 into bucket of size 2 If there are now only 2 buckets of size 2, stop Otherwise, merge previous buckets of size 2 into buckets of size 4 Etc … Time: At most log(N), since there are log(N) different sizes

Example: Updating Buckets 1001010110001011010101010101011010101010101110101010111010100010110010 0010101100010110101010101010110101010101011101010101110101000101100101 0010101100010110101010101010110101010101011101010101110101000101100101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 Slide by Jure Leskovec: Mining Massive Datasets

Example What happens if the next 3 bits are 1,1,1? … 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0ne of size 8 Two of size 4 One of size 2 Two of size 1 What happens if the next 3 bits are 1,1,1?

Reducing the Error Instead of allowing 1 or 2 of each bucket size, allow r-1 or r of each bucket size for sizes 1, 2, 4, 8, … (and an integer r>2) Of the smallest and largest size present, we allow there to be any number of buckets, from 1 to r Use similar propagation algorithm to that of before Buckets are smaller, so there is tighter bound on error Can prove that the error is at most 1/r

Counting Frequent Items

Problem Given a stream, which items are currently popular E.g., popular movie tickets, popular items being sold in Amazon, etc. appear more than s times in the window Possible solution Stream per item; at each timepoint “1” if the item appears in the original stream and “0” otherwise Use DGIM to estimate counts of 1 for each item

Problem with this approach? Example Problem with this approach? Original Stream: 1, 2, 1, 1, 3, 2, 4 Too many streams! Stream for 1: 1, 0, 1, 1, 0, 0, 0 Stream for 2: 0, 1, 0, 0, 0, 1, 0 Stream for 3: 0, 0, 0, 0, 1, 0, 0 Stream for 4: 0, 0, 0, 0, 0, 0, 1

Exponentially Decaying Windows A heuristic for selecting frequent items Gives more weight to more recent popular items Instead of computing count in the last N elements, compute smooth aggregation over entire stream If stream is a1,a2… and we are taking the sum over the stream, the answer at time t is defined as where c is a tiny constant ~ 10-6

Exponentially Decaying Windows (cont) If stream is a1,a2… and we are taking the sum over the stream, the answer at time t is defined as where c is a tiny constant ~ 10-6 When new at+1 arrives, we (1) multiply current sum by (1-c) and (2) add at+1

Example Suppose c = 10-6 What is the running score for each stream? Stream for 1: 1, 0, 1, 1, 0, 0, 0 Stream for 4: 0, 0, 0, 0, 0, 0, 1 Suppose c = 10-6 What is the running score for each stream?

Back to Our Problem For each different item x, we compute the running score of the stream defined by the characteristic function of the item Stream in which there is 1 when item appears and 0 otherwise ix=1 if ai=x and 0 otherwise

Retaining the Running Scores Each time we see some item x in the stream: Multiply all running counts by (1-c) Add 1 for the running sum corresponding to x (create a new running score with value 1 if there is none) How much memory do we need for running scores???

Property of Decaying Windows Remember, for each item x, we have a running score Summing over all running scores we get

Memory Requirements Suppose we want to find items with weight greater than ½ Since sum of all scores is 1/c, there cant be more than 2/c items with weight ½ or more! So, 2/c is a limit on the number of scores being counted at any time For other weight requirements, we would get a different bound, in a similar manner Think about it: How would you choose c?