Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Sparse Recovery (Using Sparse Matrices)

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Data Stream Mining and Querying

The Goldreich-Levin Theorem: List-decoding the Hadamard code

Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)

1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.

The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

An Optimal Algorithm for Finding Heavy Hitters

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

The Variable-Increment Counting Bloom Filter

Estimating L2 Norm MIT Piotr Indyk.

Streaming & sampling.

Estimating Lp Norms Piotr Indyk MIT.

Sublinear Algorithmic Tools 2

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Lecture 4: CountSketch High Frequencies

Lecture 7: Dynamic sampling Dimension Reduction

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Range-Efficient Counting of Distinct Elements

Linear sketching with parities

Y. Kotidis, S. Muthukrishnan,

Sublinear Algorihms for Big Data

Linear sketching with parities

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Lecture 6: Counting triangles Dynamic graphs & sampling

Sublinear Algorihms for Big Data

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Streaming Algorithms Piotr Indyk MIT

Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network traffic –Sensor networks –Approximate query optimization and answering in large databases –Scientific data streams –..And this talk

Outline Data stream model, motivations, connections Streaming problems and techniques –Estimating number of distinct elements –Other quantities: Norms, moments, heavy hitters… –Sparse approximations and compressed sensing

Basic Data Stream Model Single pass over the data: i 1, i 2,…,i n –Typically, we assume n is known Bounded storage (typically n  or log c n) –Units of storage: bits, words, coordinates or „elements” (e.g., points, nodes/edges) Fast processing time per element –Randomness OK (in fact, almost always necessary)

Connections External memory algorithms –Linear scan works best –Streaming algorithms yield efficient external memory algorithms Communication complexity/sketching –Low-space algorithms yield low- communication protocols Metric embeddings, sub-linear time algorithms, pseudo-randomness, compressive sensing AB

Counting Distinct Elements Stream elements: numbers from {1...m} Goal: estimate the number of distinct elements DE in the stream –Up to 1  –With probability 1-P Simpler goal: for a given T>0, provide an algorithm which, with probability 1-P: –Answers YES, if DE> (1+  )T –Answers NO, if DE< (1-  )T Run, in parallel, the algorithm with T=1, 1+ , (1+  ) 2,..., n –Total space multiplied by log 1+  n ≈ log(n)/ 

Vector Interpretation Initially, x=0 Insertion of i is interpreted as x i = x i +1 Want to estimate DE(x) = the number of non-zero elements of x Stream: Vector X:

Estimating DE(x) (based on[Flajolet-Martin’85]) Choose a random set S of coordinates –For each i, we have Pr[i  S]=1/T Maintain Sum S (x) =  i  S x i Estimation algorithm I: –YES, if Sum S (x)>0 –NO, if Sum S (x)=0 Analysis: –Pr=Pr[Sum S (x)=0] = (1-1/T) DE –For T large enough: (1-1/T) DE ≈e -DE/T –Using calculus, for  small enough: If DE> (1+  )T, then Pr ≈ e -(1+  ) < 1/e -  /3 if DE 1/e +  /3 Vector X: Set S:+++ (T=4) DE Pr

Estimating DE(x) ctd. We have an algorithm: –If DE> (1+  )T, then Pr<1/e-  /3 –if DE 1/e+  /3 Need to amplify the probability of correctness: –Select sets S 1 … S k, k=O(log(1/P)/  2 ) –Let Z = number of Sum Sj (x) that are equal to 0 –By Chernoff bound with probability >1-P If DE> (1+  )T, then Z<k/e if DE k/e Total space: –Decision version: O(log (1/P)/  2 ) numbers in range 0…n –Estimation: : O(log(n)/  * log (1/P)/  2 ) numbers in range 0…n (the probability of error is O(P log(n)/  ) ) Better algorithms known: –Theory: O( 1/  2 +log n) bits [Kane-Nelson-Woodruff’10] –Practice: need 128 bytes for all works of Shakespeare, ε≈10% [Durand-Flajolet’03]

Chernoff bound Let X 1 …X n be a sequence of i.i.d. 0-1 random variables such that Pr[X i =1]=p. Let X=sum i X i Then for any eps<1 we have Pr[ |X-pn| >eps pn] <2e -eps^2 pn/3

Comments Implementing S: –Choose a uniform hash function h: {1..m} -> {1..T} –Define S={i: h(i)=1} –We have i  S iff h(i)=1 Implementing h –In practice: to compute h(i) do: Seed=i Return random()

More comments The algorithm uses “linear sketches” Sum Sj (x)=  i  Sj x i Can implement decrements x i =x i -1 –I.e., the stream can contain deletions of elements (as long as x≥0) –Other names: dynamic model, turnstile model Vector X:

What other functions of a vector x can we maintain in small space ? L p norms: ||x|| p = ( ∑ i |x i | p ) 1/p –We also have ||x||  =max i |x i | –… and we can define ||x|| 0 = DE(x), since ||x|| p p =∑ i |x i | p →DE(x) as p→0 Alternatively: frequency moments F p = p-th power of L p norms (exception: F 0 = L 0 ) How much space do you need to estimate ||x|| p (for const.  ) ? Theorem: –For p  [0,2]: polylog n space suffices –For p>2: n 1-2/p polylog n space suffices and is necessary [Alon-Matias-Szegedy’96, Feigenbaum-Kannan-Strauss-Viswanathan’99, Indyk’00, Coppersmith-Kumar’04, Ganguly’04, Bar-Yossef-Jayram- Kumar-Sivakumar’02’03, Saks-Sun’03, Indyk-Woodruff’05] More General Problem

Estimating L 2 norm: AMS

Alon-Matias-Szegedy’96 Choose r 1 … r m to be i.i.d. r.v., with Pr[r i =1]=Pr[r i =-1]=1/2 Maintain Z=∑ i r i x i under increments/decrements to x i Algorithm I: Y=Z 2 “Claim”: Y “approximates” ||x|| 2 2 with “good” probability

Analysis We will use Chebyshev inequality –Need expectation, variance The expectation of Z 2 = (∑ i r i x i ) 2 is equal to E[Z 2 ] = E[∑ i,j r i x i r j x j ] = ∑ i,j x i x j E[r i r j ] We have –For i≠j, E[r i r j ] = E[r i ] E[r j ] =0 – term disappears –For i=j, E[r i r j ] =1 Therefore E[Z 2 ] = ∑ i x i 2 =||x|| 2 2 (unbiased estimator)

Analysis, ctd. The second moment of Z 2 = (∑ i r i x i ) 2 is equal to the expectation of Z 4 = (∑ i r i x i ) (∑ i r i x i ) (∑ i r i x i ) (∑ i r i x i ) This can be decomposed into a sum of –∑ i (r i x i ) 4 →expectation= ∑ i x i 4 –6 ∑ i<j (r i r j x i x j ) 2 →expectation= 6∑ i<j x i 2 x j 2 –Terms involving single multiplier r i x i (e.g., r 1 x 1 r 2 x 2 r 3 x 3 r 4 x 4 ) →expectation=0 Total: ∑ i x i 4 + 6∑ i<j x i 2 x j 2 The variance of Z 2 is equal to E[Z 4 ]-E 2 [Z 2 ] = ∑ i x i 4 + 6∑ i<j x i 2 x j 2 – (∑ i x i 2 ) 2 = ∑ i x i 4 + 6∑ i<j x i 2 x j 2 – ∑ i x i 4 -2 ∑ i<j x i 2 x j 2 = 4∑ i<j x i 2 x j 2 ≤ 2 (∑ i x i 2 ) 2

Analysis, ctd. We have an estimator Y=Z 2 –E[Y] = ∑ i x i 2 –σ 2 =Var[Y] ≤ 2 (∑ i x i 2 ) 2 Chebyshev inequality: Pr[ |E[Y]-Y| ≥ cσ ] ≤ 1/c 2 Algorithm II: –Maintain Z 1 … Z k (and thus Y 1 … Y k ), define Y ’ = ∑ i Y i /k –E[Y’] = k ∑ i x i 2 /k = ∑ i x i 2 –σ’ 2 = Var[Y’] ≤ 2k(∑ i x i 2 ) 2 /k 2 = 2 (∑ i x i 2 ) 2 /k Guarantee: Pr[ |Y’ - ∑ i x i 2 | ≥c (2/k) 1/2 ∑ i x i 2 ] ≤ 1/c 2 Setting c to a constant and k=O(1/ε 2 ) gives (1  ε)- approximation with const. probability

Comments Only needed 4-wise independence of r 1 …r m – I.e., for any subset S of {1…m}, |S|=4, and any b  {-1,1} 4, we have Pr[r S =b]=1/2 4 –Can generate such vars from O(log m) random bits What we did: –Maintain a “linear sketch” vector Z=[Z 1...Z k ] = R x –Estimator for ||x|| 2 2 : (Z Z k 2 )/k = ||Rx|| 2 2 /k –“Dimensionality reduction”: x→ Rx … but the tail somewhat “heavy” –Reason: only used second moment of the estimator