Data Stream Algorithms Lower Bounds Graham Cormode

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Shortest Vector In A Lattice is NP-Hard to approximate
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
The Communication Complexity of Approximate Set Packing and Covering
Gillat Kol (IAS) joint work with Ran Raz (Weizmann + IAS) Interactive Channel Capacity.
Data Stream Algorithms Intro, Sampling, Entropy
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Rotem Zach November 1 st, A rectangle in X × Y is a subset R ⊆ X × Y such that R = A × B for some A ⊆ X and B ⊆ Y. A rectangle R ⊆ X × Y is called.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
On the tightness of Buhrman- Cleve-Wigderson simulation Shengyu Zhang The Chinese University of Hong Kong On the relation between decision tree complexity.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Avraham Ben-Aroya (Tel Aviv University) Oded Regev (Tel Aviv University) Ronald de Wolf (CWI, Amsterdam) A Hypercontractive Inequality for Matrix-Valued.
Private Information Retrieval. What is Private Information retrieval (PIR) ? Reduction from Private Information Retrieval (PIR) to Smooth Codes Constructions.
1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Shengyu Zhang The Chinese University of Hong Kong.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.
Foundations of Cryptography Lecture 9 Lecturer: Moni Naor.
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
Quantum Computing MAS 725 Hartmut Klauck NTU TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Communication vs. Computation S Venkatesh Univ. Victoria Presentation by Piotr Indyk (MIT) Kobbi Nissim Microsoft SVC Prahladh Harsha MIT Joe Kilian NEC.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Randomization Carmella Kroitoru Seminar on Communication Complexity.
Umans Complexity Theory Lectures Lecture 7b: Randomization in Communication Complexity.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 467 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 3524 Course.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Hartmut Klauck Centre for Quantum Technologies Nanyang Technological University Singapore.
The Message Passing Communication Model David Woodruff IBM Almaden.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Streaming & sampling.
Branching Programs Part 3
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
CS 154, Lecture 6: Communication Complexity
Linear sketching with parities
Linear sketching over
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Imperfectly Shared Randomness
Presentation transcript:

Data Stream Algorithms Lower Bounds Graham Cormode

Data Stream Algorithms — Lower Bounds 2 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming … Alice Bob

Data Stream Algorithms — Lower Bounds 3 This Time: Lower Bounds So far, have seen many examples of things we can do with a streaming algorithm What about things we can’t do? What’s the best we could achieve for things we can do? Will show some simple lower bounds for data streams based on communication complexity

Data Stream Algorithms — Lower Bounds 4 Streaming As Communication Imagine Alice processing a stream Then take the whole working memory, and send to Bob Bob continues processing the remainder of the stream … Alice Bob

Data Stream Algorithms — Lower Bounds 5 Streaming As Communication Suppose Alice’s part of the stream corresponds to string x, and Bob’s part corresponds to string y......and that computing the function on the stream corresponds to computing f(x,y)......then if f(x,y) has communication complexity  (g(n)), then the streaming computation has a space lower bound of  (g(n)) Proof by contradiction: If there was an algorithm with better space usage, we could run it on x, then send the memory contents as a message, and hence solve the communication problem

Data Stream Algorithms — Lower Bounds 6 Deterministic Equality Testing Alice has string x, Bob has string y, want to test if x=y Consider a deterministic (one-round, one-way) protocol that sends a message of length m < n There are 2 m possible messages, so some strings must generate the same message: this would cause error So a deterministic message (sketch) must be  (n) bits – In contrast, we saw a randomized sketch of size O(log n) … …

Data Stream Algorithms — Lower Bounds 7 Hard Communication Problems INDEX : x is a binary string of length n y is an index in [n] Goal: output x[y] Result: (one-way) (randomized) communication complexity of INDEX is  (n) bits DISJOINTNESS : x and y are both length n binary strings Goal: Output 1 if  i: x[i]=y[i]=1, else 0 Result: (multi-round) (randomized) communication complexity of DISJOINTNESS is  (n) bits

Data Stream Algorithms — Lower Bounds 8 Simple Reduction to Disjointness F  : output the highest frequency in a stream Input: the two strings x and y from disjointness Stream: if x[i]=1, then put i in stream; then same for y Analysis: if F  =2, then intersection; if F   1, then disjoint. Conclusion: Giving exact answer to F  requires  (N) bits – Even approximating up to 50% error is hard – Even with randomization: DISJ bound allows randomness x: y: , 3, 4, 6 4, 5

Data Stream Algorithms — Lower Bounds 9 Simple Reduction to Index F  : output the number of items in the stream Input: the strings x and index y from INDEX Stream: if x[i]=1, put i in stream; then put y in stream Analysis: if (1-  )F’ 0 (x  y)>(1+  )F’(x) then x[y]=1, else it is 0 Conclusion: Approximating F 0 for  <1/N requires  (N) bits – Implies that space to approximate must be  (1/  ) – Bound allows randomization x: y: 5 1, 3, 4, 6 5

Data Stream Algorithms — Lower Bounds 10 Hardness Reduction Exercises Use reductions to DISJ or INDEX to show the hardness of: Frequent items: find all items in the stream whose frequency >  N, for some . Sliding window: given a stream of binary (0/1) values, compute the sum of the last N values – Can this be approximated instead? Min-dominance: given a stream of pairs (i,x(i)), approximate  j min (i, x(i)) x(i) Rank sum: Given a stream of (x,y) pairs and query (p,q) specified after stream, approximate |{(x,y)| x<p, y<q}|

Data Stream Algorithms — Lower Bounds 11 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming … Alice Bob

Data Stream Algorithms — Lower Bounds 12 Gap Hamming GAP-HAMM communication problem: Alice holds x  {0,1} N, Bob holds y  {0,1} N Promise: H(x,y) is either  N/2 - p N or  N/2 + p N Which is the case? Model: one message from Alice to Bob Requires  (N) bits of one-way randomized communication [Indyk, Woodruff’03, Woodruff’04, Jayram, Kumar, Sivakumar ’07]

Data Stream Algorithms — Lower Bounds 13 Hardness of Gap Hamming Reduction from an instance of INDEX Map string x to u by 1  +1, 0  -1 (i.e. u[i] = 2x[i] -1 ) Assume both Alice and Bob have access to public random strings r j, where each bit of r j is iid {-1, +1} Assume w.l.o.g. that length of string n is odd (important!) Alice computes a j = sign(r j  u) Bob computes b j = sign(r j [y]) Repeat N times with different random strings, and consider the Hamming distance of a 1... a N with b 1... b N

Data Stream Algorithms — Lower Bounds 14 Probability of a Hamming Error Consider the pair a j = sign(r j  u), b j = sign(r j [y]) Let w =  i  y u[i] r j [i] – w is a sum of (n-1) values distributed iid uniform {-1,+1} Case 1: w  0. So |w|  2, since (n-1) is even – so sign(a j ) = sign(w), independent of x[y] – Then Pr[a j  b j ] = Pr[sign(w)  sign(r j [y]) = ½ Case 2: w = 0. So a j = sign(r j  u) = sign(w + u[y]r j [y]) = sign(u[y]r j [y]) – Then Pr[a j  b j ] = Pr[sign(u[y]r j [y]) = sign(r j [y])] – This probability is 1 is u[y]=+1, 0 if u[y]=-1 – Completely biased by the answer to INDEX

Data Stream Algorithms — Lower Bounds 15 Finishing the Reduction So what is Pr[w=0]? – w is sum of (n-1) iid uniform {-1,+1} values – Textbook: Pr[w=0] = c/  n, for some constant c Do some probability manipulation: – Pr[a j = b j ] = ½ + c/2  n if x[y]=1 – Pr[a j = b j ] = ½ - c/2  n if x[y]=0 Amplify this bias by making strings of length N=4n/c 2 – Apply Chernoff bound on N instances – With prob>2/3, either H(a,b)>N/2 +  N or H(a,b)<N/2 -  N If we could solve GAP-HAMMING, could solve INDEX – Therefore, need  (N) =  (n) bits for GAP-HAMMING

Data Stream Algorithms — Lower Bounds 16 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming … Alice Bob

Data Stream Algorithms — Lower Bounds 17 Lower Bound for Entropy Alice: x  {0,1} N, Bob: y  {0,1} N Entropy estimation algorithm A Alice runs A on enc(x) =  (1,x 1 ), (2,x 2 ), …, (N,x N )  Alice sends over memory contents to Bob Bob continues A on enc(y) =  (1,y 1 ), (2,y 2 ), …, (N,y N )  (6,0)(5,1)(4,0)(3,0)(2,1)(1,1) Bob (6,1)(5,1)(4,0)(3,0)(2,1)(1,0) Alice

Data Stream Algorithms — Lower Bounds 18 Lower Bound for Entropy Observe: there are – 2H(x,y) tokens with frequency 1 each – N-H(x,y) tokens with frequency 2 each So, H(S) = log N + H(x,y)/N Thus size of Alice’s memory contents =  (N). Set  = 1/( p (N) log N) to show bound of  (  /log 1/  ) -2 ) (6,0)(5,1)(4,0)(3,0)(2,1)(1,1) Bob (6,1)(5,1)(4,0)(3,0)(2,1)(1,0) Alice

Data Stream Algorithms — Lower Bounds 19 Lower Bound for F 0 Same encoding works for F 0 (Distinct Elements) – 2H(x,y) tokens with frequency 1 each – N-H(x,y) tokens with frequency 2 each F 0 (S) = N + H(x,y) Either H(x,y)>N/2 +  N or H(x,y)<N/2 -  N – If we could approximate F 0 with  < 1/  N, could separate – But space bound =  (N) =  (  -2 ) bits Dependence on  for F 0 is tight

Data Stream Algorithms — Lower Bounds 20 Lower Bounds Exercises 1. Formally argue the space lower bound for F 2 via Gap- Hamming 2. Argue space lower bounds for F k via Gap-Hamming 3. (Research problem) Extend lower bounds for the case when the order of the stream is random or near-random 4. (Research problem) Kumar conjectures the multi-round communication complexity of Gap-Hamming is  (n) – this would give lower bounds for multi-pass streaming

Data Stream Algorithms — Lower Bounds 21 Streaming Lower Bounds Lower bounds for data streams – Communication complexity bounds – Simple reductions – Hardness of Gap-Hamming problem – Reductions to Gap-Hamming … Alice Bob