Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Data Stream Algorithms Frequency Moments
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Data Stream Mining and Querying
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,
Evaluating Hypotheses
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Data Stream Processing (Part IV)
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Lecture 15: Statistics and Their Distributions, Central Limit Theorem
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Data Stream Algorithms Lower Bounds Graham Cormode
Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks S. Ganguly M. Garofalakis R. Rastogi K.Sabnani Indian Inst. Of Tech. India Yahoo!
Calculating frequency moments of Data Stream
Instructor Neelima Gupta Expected Running Times and Randomized Algorithms Instructor Neelima Gupta
STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.
Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University :
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Frequency Counts over Data Streams
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Minwise Hashing and Efficient Search
Presentation transcript:

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

2 Outline  Introduction  Frequent moment estimation  Element Frequency estimation

3 Data Stream Processing Algorithms  Generally, algorithms compute approximate answers –Provably difficult to compute answers accurately with limited memory  Approximate answers - Deterministic bounds –Algorithms only compute an approximate answer, but bounds on error  Approximate answers - Probabilistic bounds –Algorithms compute an approximate answer with high probability With probability at least, the computed answer is within a factor of the actual answer

4 Sampling: Basics  Idea: A small random sample S of the data often well-represents all the data –For a fast approximate answer, apply “modified” query to S –Example: select agg from R (n=12) –If agg is avg, return average of the elements in S –Number of odd elements ? Data stream: Sample S: answer: 11.5

5 Probabilistic Guarantees  Example: Actual answer is within 11.5 ± 1 with prob  0.9  Randomized algorithms: Answer returned is a specially- built random variable  Use Tail Inequalities to give probabilistic bounds on returned answer –Markov Inequality –Chebyshev’s Inequality –Chernoff/Hoeffding Bound

6 Basic Tools: Tail Inequalities  General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation)  Basic Inequalities: Let X be a random variable with expectation and variance Var[X]. Then for any Probability distribution Tail probability Markov: Chebyshev:

7 Tail Inequalities for Sums  Possible to derive even stronger bounds on tail probabilities for the sum of independent Bernoulli trials  Chernoff Bound: Let X1,..., Xm be independent Bernoulli trials such that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let and be the expectation of. Then, for any,  Application to count queries: –m is size of sample S (4 in example) –p is fraction of odd elements in stream (2/3 in example) Do not need to compute Var(X), but need the independent assumption!

8 The Streaming Model  Underlying signal: One-dimensional array A[1…N] with values A[i] all initially zero –Multi-dimensional arrays as well (e.g., row-major)  Signal is implicitly represented via a stream of updates –j-th update is implying A[k] := A[k] + c[j] (c[j] can be >=0, <0)  Goal: Compute functions on A[] subject to –Small space –Fast processing of updates –Fast function computation –…

9 Streaming Model: Special Cases  Time-Series Model –Only j-th update updates A[j] (i.e., A[j] := c[j])  Cash-Register Model – c[j] is always >= 0 (i.e., increment-only) –Typically, c[j]=1, so we see a multi-set of items in one pass  Turnstile Model –Most general streaming model – c[j] can be >=0 or <0 (i.e., increment or decrement)  Problem difficulty varies depending on the model –E.g., MIN/MAX in Time-Series vs. Turnstile!

10 Frequent moment computation Problem Data arrives online ( a 1,a 2,a 3 …..a m ) Let f(i)=|{ j | a j = i }| ( represented by ||A[i]|| ) Example Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5) a ∈ {1,2,..., n } i n k F  f k ∑ i i  1 F 0 = 5, F 1 = 7, F 2 = 11 ( 1*1+2*2+2*2+1*1+1*1) ( surprise index) What is F∞?

11 Frequent moment computation  Easy for F 1  How about others ? - focus on the F 2 and F 0 - Estimation of F k

12 Linear-Projection (AMS) Sketch Synopses  Goal:  Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a stream of i-values  Basic Construct:  Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector –Simple to compute over the stream: Add whenever the i-th value is seen –Generate ‘s in small O(logN) space using pseudo-random generators –Tunable probabilistic guarantees on approximation error –Delete-Proof: Just subtract to delete an i-th value occurrence Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5) where = vector of random values from an appropriate distribution

13 AMS ( sketch ) cont.  Key Intuition: Use randomized linear projections of f() to define random variable X such that –X is easily computed over the stream (in small space) –E[X] = F 2 –Var[X] is small  Basic Idea: –Define a family of 4-wise independent {-1, +1} random variables –Pr[ = +1] = Pr[ = -1] = 1/2 Expected value of each, E[ ] = ? E[ ] = ? –Variables are 4-wise independent Expected value of product of 4 distinct = 0 E( ) = 0 –Variables can be generated using pseudo-random generator using only O(log N) space (for seeding)! Probabilistic error guarantees (e.g., actual answer is 10 ± 1 with probability 0.9)

14 Suppose { } : 1)1,2  {1}, 3,4  {-1} then Z = ? 2)4  {1}, 1,3,4  {-1} then Z = ? AMS ( sketch ) cont. Example 2 Data stream R :

15 AMS ( sketch ) cont.  Expected value of X = F 2  Using 4-wise independence, possible to show that 0 1

16 Boosting Accuracy  Chebyshev’s Inequality:  Boost accuracy to by averaging over several independent copies of X (reduces variance)  By Chebyshev: x x x Average y

17 Boosting Confidence  Boost confidence to by taking median of 2log(1/ ) independent copies of Y  Each Y = Bernoulli Trial Pr[|median(Y)-F 2 | F 2 ] (By Chernoff Bound) = Pr[ # failures in 2log(1/ ) trials >= log(1/ ) ] y y y copiesmedian 2log(1/ )“FAILURE”:

18  Step 1: Compute random variables:  Step 2: Define X= Z 2  Steps 3 & 4: Average independent copies of X; Return median of averages  Main Theorem : Sketching approximates F 2 to within a relative error of with probability using space –Remember: O(log N) space for “seeding” the construction of each X Summary of AMS Sketching for F 2 x x x Average y x x x y x x x y copies median 2log(1/ )

19 Binary-Join COUNT Query  Problem: Compute answer for the query COUNT(R A S)  Example:  Exact solution: too expensive, requires O(N) space! –N = sizeof(domain(A)) Data stream R.A: Data stream S.A: = 10 ( )

20 Basic AMS Sketching Technique [AMS96]  Key Intuition: Use randomized linear projections of f() to define random variable X such that –X is easily computed over the stream (in small space) –E[X] = COUNT(R A S) –Var[X] is small  Basic Idea: –Define a family of 4-wise independent {-1, +1} random variables

21 AMS Sketch Construction  Compute random variables: and –Simply add to X R (X S ) whenever the i-th value is observed in the R.A (S.A) stream  Define X = X R X S to be estimate of COUNT query  Example: Data stream R.A: Data stream S.A:

22 Binary-Join AMS Sketching Analysis  Expected value of X = COUNT(R A S)  Using 4-wise independence, possible to show that  is self-join size of R (second/L2 moment) 0 1

23 Boosting Accuracy  Chebyshev’s Inequality:  Boost accuracy to by averaging over several independent copies of X (reduces variance)  By Chebyshev: x x x Average y

24 Boosting Confidence  Boost confidence to by taking median of 2log(1/ ) independent copies of Y  Each Y = Bernoulli Trial Pr[|median(Y)-COUNT| COUNT] (By Chernoff Bound) = Pr[ # failures in 2log(1/ ) trials >= log(1/ ) ] y y y copiesmedian 2log(1/ )“FAILURE”:

25  Step 1: Compute random variables: and  Step 2: Define X= X R X S  Steps 3 & 4: Average independent copies of X; Return median of averages  Main Theorem (AGMS99): Sketching approximates COUNT to within a relative error of with probability using space –Remember: O(log N) space for “seeding” the construction of each X Summary of Binary-Join AMS Sketching x x x Average y x x x y x x x y copies median 2log(1/ )

26 Distinct Value Estimation ( F 0 )  Problem: Find the number of distinct values in a stream of values with domain [0,...,N-1] –Zeroth frequency moment –Statistics: number of species or classes in a population –Important for query optimizers –Network monitoring: distinct destination IP addresses, source/destination pairs, requested URLs, etc.  Example (N=64)  Hard problem for random sampling! –Must sample almost the entire table to guarantee the estimate is within a factor of 10 with probability > 1/2, regardless of the estimator used! Data stream: Number of distinct values: 5

27  Assume a hash function h(x) that maps incoming values x in [0,…, N-1] uniformly across [0,…, 2^L-1], where L = O(logN)  Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y –A value x is mapped to lsb(h(x))  Maintain Hash Sketch = BITMAP array of L bits, initialized to 0 –For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1  Prob[ lsb(h(x) = i ] = ? Hash (aka FM) Sketches for Distinct Value Estimation [FM85] x = 5 h(x) = lsb(h(x)) = BITMAP

28 Hash (FM) Sketches for Distinct Value Estimation [FM85]  By uniformity through h(x): Prob[ BITMAP[k]=1 ] = –Assuming d distinct values: expect d/2 to map to BITMAP[0], d/4 to map to BITMAP[1],...  Let R = position of rightmost zero in BITMAP –Use as indicator of log(d)  [FM85] prove that E[R] =, where –Estimate d = –Average several iid instances (different hash functions) to reduce estimator variance fringe of 0/1s around log(d) BITMAP position << log(d) position >> log(d) 0L-1

29 Accuracy of FM BITMAP BITMAP m 0 Approximation with probability at least 1-

30  [FM85] assume “ideal” hash functions h(x) (N-wise independence) –In practice h(x) =, where a, b are random binary vectors in [0,…,2^L-1]  Composable: Component-wise OR/add distributed sketches together –Estimate |S1 S2 … Sk| = set-union cardinality Hash (FM) Sketches for Distinct Value Estimation

31 Cash Register Sketch (AMS) Choose random p from 1..n and let Stream sampling | r  |{ q : q  p, a  a } qp Estimator kk X  m ( r− ( r− 1)) Using F 2 ( k=2 ) as example Data stream: 3, 1, 2, 4, 2, 3, 5,... If we choose the first element a 1 r = 2 and X = 7*(2*2-1*1) = 21 And for a 2 r = ?, X= ? a 5 r = ?, X= ? A more general algorithm for F k

32 Cash Register Sketch (AMS)  Y=Average A copies of X and Z = median of B copies Of Y’s x x x Average y x x x y x x x y B copies median Claim: This is a 1 +ε approx to F 2, and space used is O(AB) = words of size O(logn+logm) with probability at least 1-δ. A copies

33 Analysis: Cash Register Sketch E(X) = F 2 V(X) = E(X) 2 - (E(X)) 2. Using (a 2 - b 2 ) ≤ 2(a-b)a, we have V(X) ≤ 2 F 1 F 3. Also, V(X) ≤ 2 F 1 F 3 ≤. Hence, E ( Y )  E ( X )  F ii 2 V ( Y )  V ( X )/ A  ii

34 Analysis Contd. Applying Chebyshev ’ s inequality Hence, by Chernoff bounds, probability that more than B/2 Y i ’ s deviate by far is at most δ, if we take log (1/δ) of Y i ’ s. Hence, median gives the correct approximation.

35 Computation of F k  E(X) = F k  When A = B = Get approximation with probability at least 1 - r  |{ q : q  p, a  a }| qp kk X  m ( r− ( r− 1))

36 Estimate the element frequency  Ask for f(1) = ? f(4) = ? - AMS based algorithm - Count Min sketch. Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5)

37 AMS ( sketch ) based algorithm.  Key Intuition: Use randomized linear projections of f() to define random variable Z such that For given element A[i] E( Z ) = ||A[i]|| = f i Similar, we have E( Z ) = f j Basic Idea: –Define a family of 4-wise independent {-1, +1} random variables ( same as before ) Pr[ = +1] = Pr[ = -1] = ½ Let Z = So E( Z ) 1 0

38 AMS cont.  Keep an array of w ╳ d counters for Z ij  Use d hash functions to map element x to [1..w] W d a h 1 (a) h d (a) Z[i, h i (a)] += Est(f a ) = median i (Z[i,h i (a)] )

39 The Count Min (CM) Sketch  Simple sketch idea, can be used for point queries ( f i ), range queries, quantiles, join size estimation  Creates a small summary as an array of w ╳ d counters C  Use d hash functions to map element to [1..w] W = d =

40 CM Sketch Structure  Each element x i is mapped to one counter per row  C[ k,h k (x i )] = C[k, h k (x i )]+1 ( -1 if deletion ) or +c[j] if income is  Estimate A[j] by taking min k C[k,h k (j)] +1 h 1 (x i ) h d (x i ) xixi d w

41 CM Sketch Summary  CM sketch guarantees approximation error on point queries less than  in size O(1/  log 1/  ) –Probability of more error is less than 1-   Hints –Counts are biased! Can you limit the expected amount of extra “mass” at each bucket? (Use Markov)