Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Limitations of Quantum Advice and One-Way Communication Scott Aaronson UC Berkeley IAS Useful?
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Shortest Vector In A Lattice is NP-Hard to approximate
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
An Ω(n 1/3 ) Lower Bound for Bilinear Group Based Private Information Retrieval Alexander Razborov Sergey Yekhanin.
Analysis of Algorithms
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
2/14/13CMPS 3120 Computational Geometry1 CMPS 3120: Computational Geometry Spring 2013 Planar Subdivisions and Point Location Carola Wenk Based on: Computational.
Lower Bounds & Models of Computation Jeff Edmonds York University COSC 3101 Lecture 8.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Noga Alon Institute for Advanced Study and Tel Aviv University
Chapter 1 – Basic Concepts
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Complexity 26-1 Complexity Andrei Bulatov Interactive Proofs.
What is an Algorithm? (And how do we analyze one?)
Heavy hitter computation over data stream
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
On the Difficulty of Scalably Detecting Network Attacks Kirill Levchenko with Ramamohan Paturi and George Varghese.
CS151 Complexity Theory Lecture 6 April 15, 2004.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Undirected ST-Connectivity In Log Space
Undirected ST-Connectivity In Log Space Omer Reingold Slides by Sharon Bruckner.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Lower Bounds for Read/Write Streams Paul Beame Joint work with Trinh Huynh (Dang-Trinh Huynh-Ngoc) University of Washington.
Greedy Algorithms and Matroids Andreas Klappenecker.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Data Stream Algorithms Lower Bounds Graham Cormode
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
The Message Passing Communication Model David Woodruff IBM Almaden.
Clustering Data Streams A presentation by George Toderici.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Space Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Complexity course Computer science department, Bar-Ilan university December 2008.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
P & NP.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Probabilistic Algorithms
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
Streaming & sampling.
Computational Molecular Biology
Lecture 18: Uniformity Testing Monotonicity Testing
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
CS 154, Lecture 6: Communication Complexity
Enumerating Distances Using Spanners of Bounded Degree
The Complexity of Approximation
Switching Lemmas and Proof Complexity
Discrete Mathematics CS 2610
Presentation transcript:

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld

Previously... We proved 3 theorems concerning space complexity of data stream algorithms. Using the streaming model discussed earlier, we found out some lower bounds for the MAX, MAXNEIGHBOR, MAXTOTAL and MAXPATH algorithms. And now, for something completely different.

Today In this lecture, I introduce lower bounds from communication complexity. Trust me they are correct. Using these bounds and (mostly) reductions, our goal is to prove even more theorems. Theorems are good. I'll prove 3 of them. Starting with “Theorem 4”.

Theorem 4 Setting: Sequence of m numbers in {1,...,n}. – Multiple occurences are allowed. Claim: Finding the k most frequent items requires Ω(n/k) space. Moreover, random sampling yields an upper bound of O(n (log m + log n) / k). We're going to use a blackbox to prove it.

Theorem 4 blackbox Alon-Matias-Szegedy: Finding the most frequent number in a sequence of length m in range {1,...,n} takes Ω(n) space. Proof outline: Reduction. Namely, we create a new stream that we can (ab)use this blackbox on. The reduction will replace each number in the sequence with a sequence of numbers: – Each i in {1,...,n} is replaced with ki+1,...,ki+k. – In total, nk numbers.

Reduction example Our data stream is {4,5,3,2,7,3,4,5,1} in range {1,...,10} and we want to obtain the 2 most occuring numbers. The reduction will create the numbers: {9,10}, {11, 12}, {7, 8}, {5, 6}, {15,16}, {7, 8}, {9, 10}, {11,12}, {3, 4} The most occuring numbers in the original sequence are the most occuring number in the new sequence.

Proof outline If x i =x j, then the sequences created by the reduction coincide. Otherwise, they are disjoint. If x i occurs l times in the stream, it'll occur kl times in the new stream. It follows that finding one of the k most frequent items in one pass requires Ω(n/k) space. Running this 'algorithm' k times we get the AMS theorem. Great success.

As for the upper bound Reminder: a Monte-Carlo algorithm is a randomized algorithm that succeeds with a high probability. So we'll show a Monte-Carlo algorithm that succeeds with high probability to get the right upper bound.

The Monte-Carlo algorithm Before reading the stream: – Sample each number with probability 1/k. – Only keep a counter for the sampled numbers. Read the stream normally. Output the successfully sampled number with largest count. With constant probability, one of the k-th most frequent numbers has been sampled successfully. This requires O(n (log m + log n) / k) space. Epic win.

And now for something completely different Introducing the approximate median problem (AMP). Reminder: The median is the value which separates the higher half of the set from the lower half. We want to approximate that. Why? Because it's cool.

This slide isn't the median problem First, a blackbox from communication complexity. Consider the bit-vector probing problem: – Let A have a bit sequence of length m and B an index i. B needs to know x i, the i-th input bit. – But the communication is one way only, B can not send anything to A. Ideas?

Blackbox cont. Turns out there isn't a better method for A to send the i-th bit than to send the entire string to B. – So it takes Ω (m) space. But what about randomization? – Too bad, any algorithm that succeeds in guessing x i – With probability better than (1+ ε )/2 – Requires at least εm bits of communication.

Approximate median problem Goal: Find a number whose rank is in the interval [m/2 – ε m, m/2 + ε m]. It can be solved by a one-pass Monte-Carlo algorithm with 1/10 error probability. Takes O(log n (log 1/ ε ) 2 / ε ) space. I have a truly magnificent proof of this theorem. This slideshow is too small to contain it.

AMP cont. Motivation: We want to prove a corresponding lower bound on this problem. How: We show that any 1-pass Las Vegas algorithm that solves the ε-AMP requires Ω(1/ε) space. We show a reduction from the bit-vector probing problem.

AMP lower bound proof Let B be a bit vector, followed by a query index i. This is translated to a sequence of numbers as follows: – First, output 2j+b j, for each j. – Then, upon getting the query, output n-i+1 copies of 0 and i+1 copies of 2(n+1).

Reduction example B = (0,1,0,1,1,0,1,1,0,1), i=5. The reduction maps: – 2j+b j : [2,5,6,9,11,12,15,17,18,21] – N-i+1=6 copies of 0: [0,0,0,0,0,0] – i+1=6 copies of 22=2(n+1): [22,22,22,22,22,22] The median of this set is 11. It's LSB is 1. Which is exactly the value of b 5.

AMP proof cont. It is easily verified that the least significant bit of the median of this sequence is the value of b i (that is, the bit we seek). Choose ε=1/2n. Therefore the ε-approximate median is the exact median. This is true because we have 2n numbers in the “reduced” stream. Therefore any one-pass algorithm that requires fewer than 1/2ε = n bits of memory can be used...

AMP proof cont. … to derive a communication protocol that requires fewer than n bits to be communicated from A to B in solving bit vector probing. But every protocol that solves bit vector probing must communicate n bits. Contradiction. Quod erat demonstratum.

Corollary What's the point I've been trying to make? Randomization can sometimes reduce space complexity significantly, at the cost of guarantee of output correctness. Moving right along.

Some graph theory A graph can be considered as a stream. – Example: Adjacency list. This means some graph-theoretic problems can be approximated or solved using data stream and communication complexity techniques. I'll address a small part of them.

Why is this good? Suppose we can read the stream more than once (we don't have enough memory to store it but we do have access). But the amount of times we can read the stream is finite. What possible graph theoretic problems could we approximate with this method?

Theorem 6 I n P passes, the following problems on an n-node graph take Ω (n / P) space: – Computing connected components – Computing k-edge connected components. – Computing k-vertex connected components. – Testing graph planarity. – Finding the sinks of a directed graph. I'll prove graph connectivity.

Connected components Proof by reduction of DISJOINT to the graph connectivity problem. Reminder: DISJOINT(x,y) returns 1 iff there exists i such that x i =y i. Given bit vectors A and B, construct a graph with vertices {a,b,1,...,n}. Insert an edge (a,i) iff i is in A's vector and an edge (i,b) iff it's in B's vector. The graph is connected iff there exists a bit that's set in both vectors.

Connectivity cont. From communication complexity, we know that every DISJOINT-solving protocol sends Ω (n) bits. So if we have P passes over the data, one of the passes must use Ω (n / P) space. This is a total cheating hack by the way. Blame HRR. QED anyway. That's all folks!