Estimating Distinct Elements, Optimally

Slides:

Advertisements

Similar presentations

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Quantum t-designs: t-wise independence in the quantum world Andris Ambainis, Joseph Emerson IQC, University of Waterloo.

The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Numerical Linear Algebra in the Streaming Model

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Xiaoming Sun Tsinghua University David Woodruff MIT

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.

Shortest Vector In A Lattice is NP-Hard to approximate

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

The Communication Complexity of Approximate Set Packing and Covering

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Fast Algorithms For Hierarchical Range Histogram Constructions

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Foundations of Cryptography Lecture 4 Lecturer: Moni Naor.

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Polynomial time approximation scheme Lecture 17: Mar 13.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Chapter Resynchsonous Stabilizer Chapter 5.1 Resynchsonous Stabilizer Self-Stabilization Shlomi Dolev MIT Press, 2000 Draft of Jan 2004, Shlomi.

1 Constructing Pseudo-Random Permutations with a Prescribed Structure Moni Naor Weizmann Institute Omer Reingold AT&T Research.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

On Everlasting Security in the Hybrid Bounded Storage Model Danny Harnik Moni Naor.

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

1 Fingerprinting techniques. 2 Is X equal to Y? = ? = ?

Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Information Theory for Data Streams David P. Woodruff IBM Almaden.

Data Stream Algorithms Lower Bounds Graham Cormode

Calculating frequency moments of Data Stream

Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

Error-Correcting Codes and Pseudorandom Projections Luca Trevisan U.C. Berkeley.

The Message Passing Communication Model David Woodruff IBM Almaden.

Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.

Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

Information Complexity Lower Bounds

New Characterizations in Turnstile Streams with Applications

Finding Frequent Items in Data Streams

Streaming & sampling.

Counting How Many Elements Computing “Moments”

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Range-Efficient Counting of Distinct Elements

The Communication Complexity of Distributed Set-Joins

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Presentation transcript:

Estimating Distinct Elements, Optimally David Woodruff IBM Based on papers with Piotr Indyk, Daniel Kane, and Jelani Nelson

Problem Description Given a long string of at most n distinct characters, count the number F0 of distinct characters See characters one at a time One pass over the string Algorithms must use small memory and fast update time too expensive to store set of distinct characters algorithms must be randomized and settle for an approximate solution: output F 2 [(1-²)F0, (1+²)F0] with, say, good constant probability

Algorithm History Flajolet and Martin introduced problem, FOCS 1983 O(log n) space for fixed ε in random oracle model Alon, Matias and Szegedy O(log n) space/update time for fixed ε with no oracle Gibbons and Tirthapura O(ε-2 log n) space and O(ε-2) update time Bar-Yossef et al O(ε-2 log n) space and O(log 1/ε) update time O(ε-2 log log n + log n) space and O(ε-2) update time, essentially Similar space bound also obtained by Flajolet et al in the random oracle model Kane, Nelson and W O(ε-2 + log n) space and O(1) update and reporting time All time complexities are in unit-cost RAM model

Θ(ε-2 + log n) space and Θ(1) update and reporting time Lower Bound History Alon, Matias and Szegedy Any algorithm requires Ω(log n) bits of space Bar-Yossef Any algorithm requires Ω(ε-1) bits of space Indyk and W If ε > 1/n1/9, any algorithm needs Ω(ε-2) bits of space W If ε > 1/n1/2, any algorithm needs Ω(ε-2) bits of space Jayram, Kumar and Sivakumar Simpler proof of Ω(ε-2) bound for any ε > 1/m1/2 Brody and Chakrabarti Show above lower bounds hold even for multiple passes over the string Combining upper and lower bounds, the complexity of this problem is: Θ(ε-2 + log n) space and Θ(1) update and reporting time

Outline for Remainder of Talk Proofs of the Upper Bounds Proofs of the Lower Bounds

Hash Functions for Throwing Balls We consider a random mapping f of B balls into C containers and count the number of non-empty containers The expected number of non-empty containers is C – C(1-1/C)B If instead of the mapping f, we use an O(log C/ε)/log log C/ε – wise independent mapping g, then the expected number of non-empty containers under g is the same as that under f, up to a factor of (1 ± ε) Proof based on approximate inclusion-exclusion express 1 – (1-1/C)B in terms of a series of binomial coefficients truncate the series at an appropriate place use limited independence to handle the remaining terms

Fast Hash Functions Use hash functions g that can be evaluated in O(1) time. If g is O(log C/ε)/(log log C/ε)-wise independent, the natural family of polynomial hash functions doesn’t work We use theorems due to Pagh, Pagh, and Siegel that construct k-wise independent families for large k, and allow O(1) evaluation time For example, Siegel shows: Let U = [u] and V = [v] with u = vc for a constant c > 1, and suppose the machine word size is Ω(log v) Let k = vo(1) be arbitrary For any constant d > 0 there is a randomized procedure that constructs a k-wise independent hash family H from U to V that succeeds with probability 1-1/vd and requires vd space. Each h 2 H can be evaluated in O(1) time Can show we have sufficiently random hash functions that can be evaluated in O(1) time and represented with O(ε-2 + log n) bits of space

Algorithm Outline Set K = 1/ε2 Instantiate a lg n x K bitmatrix A, initializing entries of A to 0 Pick random hash functions f: [n]->[n] and g: [n]->[K] Obtain a constant factor approximation R to F0 somehow Update(i): Set A1, g(i) = 1, A2, g(i) = 1, …, Alsb(f(i)), g(i) = 1 Estimator: Let T = |{j in [K]: Alog (16R/K), j = 1}| Output (32R/K) * ln(1-T/K)/ln(1-1/K)

Space Complexity Naively, A is a lg n x K bitmatrix, so O(ε-2 log n) space Better: for each column j, store the identity of the largest row i(j) for which Ai, j = 1. Note if Ai,j = 1, then Ai’, j = 1 for all i’ < i Takes O(ε-2 log log n) space Better yet: maintain a “base level” I. For each column j, store max(i(j) – I, 0) Given an O(1)-approximation R to F0 at each point in the stream, set I = log R Don’t need to remember i(j) if i(j) < I, since j won’t be used in estimator For the j for which i(j) ¸ I, about 1/2 such j will have i(j) = I, about one fourth such j will have i(j) = I+1, etc. Total number of bits to store offsets is now only O(K) = O(ε-2) with good probability at all points in the stream

The Constant Factor Approximation Previous algorithms state that at each point in the stream, with probability 1-δ, the output is an O(1)-approximation to F0 The space of such algorithms is O(log n log 1/δ). Union-bounding over a stream of length m gives O(log n log m) total space We achieve O(log n) space, and guarantee the O(1)-approximation R of the algorithm is non-decreasing Apply the previous scheme on a log n x log n/(log log n) matrix For each column, maintain the identity of the deepest row with value 1 Output 2i, where i is the largest row containing a constant fraction of 1s We repeat the procedure O(1) times, and output the median of the estimates Can show the output is correct with probability 1- O(1/log n) Then we use the non-decreasing property to union-bound over O(log n) events We only increase the base level every time R increases by a factor of 2 Note that the base level never decreases

Running Time Blandford and Blelloch Definition: a variable length array (VLA) is a data structure implementing an array C1, …, Cn supporting the following operations: Update(i, x) sets the value of Ci to x Read(i) returns Ci The Ci are allowed to have bit representations of varying lengths len(Ci). Theorem: there is a VLA using O(n + sumi len(Ci)) bits of space supporting worst case O(1) updates and reads, assuming the machine word size is at least log n Store our offsets in a VLA, giving O(1) update time for a fixed base level Occasionally we need to update the base level and decrement offsets by 1 Show base level only increases after Θ(ε-2) updates, so can spread this work across these updates, so O(1) worst-case update time Copy the data structure, use it for performing this additional work so it doesn’t interfere with reporting the correct answer When base level changes, switch to copy For O(1) reporting time, maintain a count of non-zero containers in a level

Outline for Remainder of Talk Proofs of the Upper Bounds Proofs of the Lower Bounds

1-Round Communication Complexity Alice: Bob: What is f(x,y)? input x input y Alice sends a single, randomized message M(x) to Bob Bob outputs g(M(x), y) for a randomized function g g(M(x), y) should equal f(x,y) with constant probability Communication cost CC(f) is |M(x)|, maximized over x and random bits Alice creates a string s(x), runs a randomized algorithm A on s(x), and transmits the state of A(s(x)) to Bob Bob creates a string s(y), continues A on s(y), thus computing A(s(x)◦s(y)) If A(s(x)◦s(y)) can be used to solve f(x,y), then space(A) ¸ CC(f)

The Ω(log n) Bound Consider equality function: f(x,y) = 1 if and only if x = y for x, y 2 {0,1}n/3 Well known that CC(f) = Ω(log n) for (n/3)-bit strings x and y Let C: {0,1}n/3 -> {0,1}n be an error-correcting code with all codewords of Hamming weight n/10 If x = y, then C(x) = C(y) If x != y, then ¢(C(x), C(y)) = Ω(n) Let s(x) be any string on alphabet size n with i-th character appearing in s(x) if and only if C(x)i = 1. Similarly define s(y) If x = y, then F0(s(x)◦s(y)) = n/10. Else, F0(s(x)◦s(y)) = n/10 + Ω(n) A constant factor approximation to F0 solves f(x,y)

The Ω(ε-2) Bound Let r = 1/ε2. Gap Hamming promise problem for x, y in {0,1}r f(x,y) = 1 if ¢(x,y) > 1/(2ε-2) f(x,y) = 0 if ¢(x,y) < 1/(2ε-2) - 1/ε Theorem: CC(f) = Ω(ε-2) Can prove this from the Indexing function Alice has w 2 {0,1}r, Bob has i in {1, 2, …, r}, output g(w, i) = wi Well-known that CC(g) = Ω(r) Proof: CC(f) = Ω(r), Alice sends the seed r of a pseudorandom generator to Bob, so the parties have common random strings zi, …, zr 2 {0,1}r Alice sets x = coordinate-wise-majority{zi | wj = 1} Bob sets y = zi Since the zi are random, if xj = 1, then by properties of majority, with good probability ¢f(x,y) < 1/(2ε-2) - 1/ε, otherwise likely that ¢f(x,y) > 1/(2ε-2) Repeat a few times to get concentration

The Ω(ε-2) Bound Continued Need to create strings s(x) and s(y) to have F0(s(x)◦s(y)) decide whether ¢(x,y) > 1/(2ε-2) or ¢(x,y) < 1/(2ε-2) - 1/ε Let s(x) be a string on n characters where character i appears if and only if xi = 1. Similarly define s(y) F0(s(x)◦s(y)) = (wt(x) + wt(y) + ¢(x,y))/2 Alice sends wt(x) to Bob A calculation shows a (1+ε)-approximation to F0(s(x)◦s(y)), together with wt(x) and wt(y), solves the Gap-Hamming problem Total communication is space(A) + log 1/ε = Ω(ε-2) It follows that space(A) = Ω(ε-2)

Conclusion Combining upper and lower bounds, the streaming complexity of estimating F0 up to a (1+ε) factor is: Θ(ε-2 + log n) bits of space and Θ(1) update and reporting time Upper bounds based on careful combination of efficient hashing, sampling and various data structures Lower bounds come from 1-way communication complexity