Streaming Algorithms Piotr Indyk MIT
Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network traffic –Sensor networks –Approximate query optimization and answering in large databases –Scientific data streams –..And this talk
Outline Data stream model, motivations, connections Streaming problems and techniques –Estimating number of distinct elements –Other quantities: Norms, moments, heavy hitters… –Sparse approximations and compressed sensing
Basic Data Stream Model Single pass over the data: i 1, i 2,…,i n –Typically, we assume n is known Bounded storage (typically n or log c n) –Units of storage: bits, words, coordinates or „elements” (e.g., points, nodes/edges) Fast processing time per element –Randomness OK (in fact, almost always necessary)
Connections External memory algorithms –Linear scan works best –Streaming algorithms yield efficient external memory algorithms Communication complexity/sketching –Low-space algorithms yield low- communication protocols Metric embeddings, sub-linear time algorithms, pseudo-randomness, compressive sensing AB
Counting Distinct Elements Stream elements: numbers from {1...m} Goal: estimate the number of distinct elements DE in the stream –Up to 1 –With probability 1-P Simpler goal: for a given T>0, provide an algorithm which, with probability 1-P: –Answers YES, if DE> (1+ )T –Answers NO, if DE< (1- )T Run, in parallel, the algorithm with T=1, 1+ , (1+ ) 2,..., n –Total space multiplied by log 1+ n ≈ log(n)/
Vector Interpretation Initially, x=0 Insertion of i is interpreted as x i = x i +1 Want to estimate DE(x) = the number of non-zero elements of x Stream: Vector X:
Estimating DE(x) (based on[Flajolet-Martin’85]) Choose a random set S of coordinates –For each i, we have Pr[i S]=1/T Maintain Sum S (x) = i S x i Estimation algorithm I: –YES, if Sum S (x)>0 –NO, if Sum S (x)=0 Analysis: –Pr=Pr[Sum S (x)=0] = (1-1/T) DE –For T large enough: (1-1/T) DE ≈e -DE/T –Using calculus, for small enough: If DE> (1+ )T, then Pr ≈ e -(1+ ) < 1/e - /3 if DE 1/e + /3 Vector X: Set S:+++ (T=4) DE Pr
Estimating DE(x) ctd. We have an algorithm: –If DE> (1+ )T, then Pr<1/e- /3 –if DE 1/e+ /3 Need to amplify the probability of correctness: –Select sets S 1 … S k, k=O(log(1/P)/ 2 ) –Let Z = number of Sum Sj (x) that are equal to 0 –By Chernoff bound with probability >1-P If DE> (1+ )T, then Z<k/e if DE k/e Total space: –Decision version: O(log (1/P)/ 2 ) numbers in range 0…n –Estimation: : O(log(n)/ * log (1/P)/ 2 ) numbers in range 0…n (the probability of error is O(P log(n)/ ) ) Better algorithms known: –Theory: O( 1/ 2 +log n) bits [Kane-Nelson-Woodruff’10] –Practice: need 128 bytes for all works of Shakespeare, ε≈10% [Durand-Flajolet’03]
Chernoff bound Let X 1 …X n be a sequence of i.i.d. 0-1 random variables such that Pr[X i =1]=p. Let X=sum i X i Then for any eps<1 we have Pr[ |X-pn| >eps pn] <2e -eps^2 pn/3
Comments Implementing S: –Choose a uniform hash function h: {1..m} -> {1..T} –Define S={i: h(i)=1} –We have i S iff h(i)=1 Implementing h –In practice: to compute h(i) do: Seed=i Return random()
More comments The algorithm uses “linear sketches” Sum Sj (x)= i Sj x i Can implement decrements x i =x i -1 –I.e., the stream can contain deletions of elements (as long as x≥0) –Other names: dynamic model, turnstile model Vector X:
What other functions of a vector x can we maintain in small space ? L p norms: ||x|| p = ( ∑ i |x i | p ) 1/p –We also have ||x|| =max i |x i | –… and we can define ||x|| 0 = DE(x), since ||x|| p p =∑ i |x i | p →DE(x) as p→0 Alternatively: frequency moments F p = p-th power of L p norms (exception: F 0 = L 0 ) How much space do you need to estimate ||x|| p (for const. ) ? Theorem: –For p [0,2]: polylog n space suffices –For p>2: n 1-2/p polylog n space suffices and is necessary [Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99, Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-Jayram- Kumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05] More General Problem
Estimating L 2 norm: AMS
Alon-Matias-Szegedy’96 Choose r 1 … r m to be i.i.d. r.v., with Pr[r i =1]=Pr[r i =-1]=1/2 Maintain Z=∑ i r i x i under increments/decrements to x i Algorithm I: Y=Z 2 “Claim”: Y “approximates” ||x|| 2 2 with “good” probability
Analysis We will use Chebyshev inequality –Need expectation, variance The expectation of Z 2 = (∑ i r i x i ) 2 is equal to E[Z 2 ] = E[∑ i,j r i x i r j x j ] = ∑ i,j x i x j E[r i r j ] We have –For i≠j, E[r i r j ] = E[r i ] E[r j ] =0 – term disappears –For i=j, E[r i r j ] =1 Therefore E[Z 2 ] = ∑ i x i 2 =||x|| 2 2 (unbiased estimator)
Analysis, ctd. The second moment of Z 2 = (∑ i r i x i ) 2 is equal to the expectation of Z 4 = (∑ i r i x i ) (∑ i r i x i ) (∑ i r i x i ) (∑ i r i x i ) This can be decomposed into a sum of –∑ i (r i x i ) 4 →expectation= ∑ i x i 4 –6 ∑ i<j (r i r j x i x j ) 2 →expectation= 6∑ i<j x i 2 x j 2 –Terms involving single multiplier r i x i (e.g., r 1 x 1 r 2 x 2 r 3 x 3 r 4 x 4 ) →expectation=0 Total: ∑ i x i 4 + 6∑ i<j x i 2 x j 2 The variance of Z 2 is equal to E[Z 4 ]-E 2 [Z 2 ] = ∑ i x i 4 + 6∑ i<j x i 2 x j 2 – (∑ i x i 2 ) 2 = ∑ i x i 4 + 6∑ i<j x i 2 x j 2 – ∑ i x i 4 -2 ∑ i<j x i 2 x j 2 = 4∑ i<j x i 2 x j 2 ≤ 2 (∑ i x i 2 ) 2
Analysis, ctd. We have an estimator Y=Z 2 –E[Y] = ∑ i x i 2 –σ 2 =Var[Y] ≤ 2 (∑ i x i 2 ) 2 Chebyshev inequality: Pr[ |E[Y]-Y| ≥ cσ ] ≤ 1/c 2 Algorithm II: –Maintain Z 1 … Z k (and thus Y 1 … Y k ), define Y ’ = ∑ i Y i /k –E[Y’] = k ∑ i x i 2 /k = ∑ i x i 2 –σ’ 2 = Var[Y’] ≤ 2k(∑ i x i 2 ) 2 /k 2 = 2 (∑ i x i 2 ) 2 /k Guarantee: Pr[ |Y’ - ∑ i x i 2 | ≥c (2/k) 1/2 ∑ i x i 2 ] ≤ 1/c 2 Setting c to a constant and k=O(1/ε 2 ) gives (1 ε)- approximation with const. probability
Comments Only needed 4-wise independence of r 1 …r m – I.e., for any subset S of {1…m}, |S|=4, and any b {-1,1} 4, we have Pr[r S =b]=1/2 4 –Can generate such vars from O(log m) random bits What we did: –Maintain a “linear sketch” vector Z=[Z 1...Z k ] = R x –Estimator for ||x|| 2 2 : (Z Z k 2 )/k = ||Rx|| 2 2 /k –“Dimensionality reduction”: x→ Rx … but the tail somewhat “heavy” –Reason: only used second moment of the estimator