Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
The Equivalence of Sampling and Searching Scott Aaronson MIT.
Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence. Anna GalUT Austin Parikshit GopalanU. Washington.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
The Communication Complexity of Approximate Set Packing and Covering
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
1 Information complexity and exact communication bounds April 26, 2013 Mark Braverman Princeton University Based on joint work with Ankit Garg, Denis Pankratov,
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
On the tightness of Buhrman- Cleve-Wigderson simulation Shengyu Zhang The Chinese University of Hong Kong On the relation between decision tree complexity.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Introduction to Modern Cryptography Homework assignments.
1 Lecture 18 Syntactic Web Clustering CS
Avraham Ben-Aroya (Tel Aviv University) Oded Regev (Tel Aviv University) Ronald de Wolf (CWI, Amsterdam) A Hypercontractive Inequality for Matrix-Valued.
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.
Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Foundations of Cryptography Lecture 9 Lecturer: Moni Naor.
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
1 Information and interactive computation January 16, 2012 Mark Braverman Computer Science, Princeton University.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
Analysis of Algorithms
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
NP Complexity By Mussie Araya. What is NP Complexity? Formal Definition: NP is the set of decision problems solvable in polynomial time by a non- deterministic.
Lower Bounds for Read/Write Streams Paul Beame Joint work with Trinh Huynh (Dang-Trinh Huynh-Ngoc) University of Washington.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
Umans Complexity Theory Lectures Lecture 1a: Problems and Languages.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Quantum algorithms vs. polynomials and the maximum quantum-classical gap in the query model.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Forrelation: A Problem that Optimally Separates Quantum from Classical Computing.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 467 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 3524 Course.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Hartmut Klauck Centre for Quantum Technologies Nanyang Technological University Singapore.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Pseudorandomness: New Results and Applications Emanuele Viola IAS April 2007.
Imperfectly Shared Randomness
Random Access Codes and a Hypercontractive Inequality for
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
Streaming & sampling.
CS 154, Lecture 6: Communication Complexity
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Linear sketching with parities
Linear sketching with parities
Imperfectly Shared Randomness
Streaming Symmetric Norms via Measure Concentration
Complexity Theory: Foundations
Presentation transcript:

Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams

Space…the Final Frontier. These are the voyages of the starship Enterprise. Its five-year mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no man has gone before. -Star Trek

 Traditionally, “efficient” computation is identified with polynomial time Clearly, not adequate for computations over massive data sets  Similarly, we want a single notion of efficient computation over massive datasets Efficient Algorithms for Massive Data Sets

A Single Theory?  Modern computing systems are complex and varied Memory architectures Distributed computing Randomization Etc.  Paradigms of computing Sampling, sketching, data stream, read-write streams, stream-sort, map-reduce And many more yet to come

Lower Bounds  This is a fertile ground for proving results  Many successes  Certain problems seem to be fundamental  Reductions play a big role

 Sampling: Query a small number of data elements  Data streams: Stream through the data in a one-way fashion; limited main memory storage Models with Limited Main Memory Algorithm Data Set Algorithm Data Set

Distributed Computing  Sketching: Compress data chunks into small “sketches”; compute over the sketches Algorithm Data Set

Sampling: Lower Bounds for Symmetric Functions  In general, sampling algorithms are adaptive Proof Idea Let T be a sampling algorithm for the function Randomly permute the data elements Run T Resulting algorithm estimates the function and uses uniform samples Theorem [Bar-Yossef, Kumar, Sivakumar] When estimating symmetric functions, uniform sampling is the best possible.

Lower Bounds for Uniform Sampling Tools: block sensitivity Hellinger distance Kullback-Leibler divergence Jensen-Shannon divergence Combinatorics [Nisan] Statistics [Bar Yossef et al.] Information theory [Bar-Yossef]

Example  Find the mean of n numbers in [0,1]  Requires (1/ 2 ) samples to approximate additively within   Lower Bound proof using Hellinger distance

Step 1  Simplify to a decision problem a : ½ + ε 0’s and ½ - ε 1’s b : ½ - ε 0’s and ½ + ε 1’s  Given x 2 {a,b}, any sampling algorithm for mean (with additive error  /4) can distinguish whether x=a or x=b

Step 2  Let P a = distribution on {0,1} by sampling uniformly from a; Similarly P b …  Compute Hellinger distance h 2 (P a,P b ) For discrete distributions P, Q h 2 (P,Q) = k √P - √Q k 2 = 1 - Σ x (P(x) Q(x)) ½ h 2 (P a,P b ) = O( 2 )

Lower bound via Hellinger Distance Key Idea: multiplicative property of Hellinger distance: 1 - h 2 (P k,Q k ) = (1 - h 2 (P,Q)) k Theorem. Any uniform sampling algorithm needs Ω(1/ h 2 (P a,P b )) samples to distinguish input a from input b

Lower Bounds for Data Streams  Idea is to somehow bound the flow of information (yields space lower bounds)  Model is too “fine-grained” to prove lower bounds directly  Instead, we consider more powerful models (hopefully simpler to tackle)

Communication complexity xy f(x,y) Extensions to multiple parties Resources: # bits # rounds

Connection to Data Streams Data stream algorithm for f(x ± y) Space s Passes k ) O(ks), 2k round protocol for f(x,y)  Data stream algorithm for f(x 1 ± x 2 ±  ± x t ) Space s Passes k ) O(tks) protocol for f(x 1,x 2,…,x t )

Caveat  Communication complexity usually deals with decision problems  Data stream problems involve approximation computations  Usual reduction techniques yield promise problems in c.c.

Set Disjointness  Sets A,B µ [n]  Alice has A and Bob has B Is A \ B  ; ?  Classical problem in c.c.  t-party version [Alon,Matias,Szegedy]

C.C. Lower Bounds for Set Disjointness Remarks: Choose a “hard distribution” on inputs and show a lower bound on communication Unfortunately, the hard distributions here involves correlated inputs The arguments are somewhat tricky Theorem: Randomized c.c. of Disjointness is (n) [Kalyanasundaram,Schnitger; Razborov]

Direct Sum Methodology  x and y are characteristic vectors AND (a,b) = a ^ b INT (x,y) = _ i (x i ^ y i ) = _ i AND (x i,y i )  Establish that any protocol for INT must solve n independent copies of AND  This is not true for communication itself !

Information Cost P a protocol for a function f [Chakrabarti, Sun, Wirth, Yao] Information cost of P = I(X,Y : P(X,Y)) X,Y are suitably distributed I( : ) denotes Shannon mutual information [Bar-Yossef, J., Kumar, Sivakumar] Conditional Information Cost of P = I(X,Y : P(X,Y) | D)

Information Complexity  Let  be a distribution on inputs  IC(f) = minimum information cost of a protocol computing f where the inputs are distributed according to 

Proposition. CC(f)  IC(f) Proof: Let P compute f I(X,Y : P(X,Y) | D)  H(P(X,Y) | D)  H(P(X,Y))(conditioning reduces entropy)  |P|(entropy bounded by bits) Information vs Communication Complexity

Distribution for Disjointness  For each i = 1, …, n, independently: D i  R {a,b} If D i = a then x i = 0, y i  R {0,1} If D i = b then x i  R {0,1}, y i = 0 Remarks:  This always produces disjoint sets!  Conditioned on D, X and Y are independent

Direct Sum Theorem Theorem. IC( INT ) ¸ n ¢ IC( AND ) ……   X1X1 X2X2 Y1Y1 Y2Y2 a 00 0 b XnXn YnYn

Information Complexity of AND  Nice connections to statistical distances  In case of AND, this reduces to getting a lower bound on the Hellinger distance: h 2 ( AND (0,1), AND (1,0)) a b

More Thoughts  Extension to t-party set disjointness: lower bound of (n/t 2 ) Can be improved to (n/(t log t)) [Chakrabarti,Khot,Sun] Yields optimal space lower bounds for frequency moments F k, k > 2  Method also gives optimal bounds for L 1 [Saks,Sun] proved similar bounds for 1 pass For L p, p>2, the space bound is polynomial with a minor gap between u.b. and l.b. in terms of p

Reductions – Example for F 0  Indexing: Alice holds a set A of size n/2 Bob holds an element b Is b 2 A?  One-way c.c. of Indexing is (n) Shatter coefficients are useful here [BJKS]  F 0 = n/2 or n/2+1 Gap can be amplified by padding Yields a (1/) bound Improved to (1/ 2 ) but requires substantial new ideas [Indyk,Woodruff; Woodruff]

Lower bounds for Sketching Simultaneous messages f(x 1,x 2,…,x t ) A 1 (x 1 ) A t (x t ) A 2 (x 2 )

Beyond Data Streams: a Peek at External Memory  Efficient access to external memory is possible in restricted ways I/O rates for sequential read/write access to disks are as good as random access to main memory  New models of I/O-efficient computing Read/write streams [Grohe,Schweikardt; G,Hernich,S] StrSort [Aggarwal,Datar,Rajagopalan,Ruhl] Map-reduce [Dean,Ghemawat]

Read/Write Streams  Also called Reversal Turing Machines by [GS] Input t streams Machine Memory

Critical Resources  #tapes t  space s  No constraint on the length of streams  But #reversals is at most r  An (r,s,t) read/write stream algorithm Sorting has an (O(log N), O(log N), O(1)) read/write stream algorithm What happens when #reversals is o(log N)?

Lower Bounds  There is no reduction using c.c. E.g. Equality of strings is easy here So what gives?  The intuition is that it is hard to compare data elements at random locations  Grohe and Schweikardt formalize this and give a nice lower bound technique later extended to 1-sided error [GHS]

ymym y2y2 y (m) y (2) Difficult Problems for Read/Write Streams  A direct-sum type of problem with inputs moved around … h g x1x1 y1y1 y (1) g x2x2 g xmxm Pick a permutation  with small monotonicity

Previous Results  Sorting with o(log N) reversals requires (N 1/5 ) space [GS]  Set Equality with o(log N) reversals requires (N 1/4 ) space [GHS] Also applies to Sorting  Bounds hold for deterministic and randomized 1-sided error models

Our Results [Beame, J., Rudra]  Lower bounds for 2-sided error randomized computation  Set Disjointness with o(log N/log log N) reversals requires near-linear space  We derive our results in a direct-sum framework

Lower Bound Technique 1 st step: List machine  records the potential ways in which subsets of input elements can be “compared” at different stages of the computation 2 nd step: Skeleton  Describes the information flow in terms of the locations of elements that are compared

Key Theorem of [GH,GSH]  Skeletons resemble transcripts in c.c. Theorem. The skeletons partition the input domain such that (1) #skeletons is “small” (2) output depends only on the skeleton (3) Each skeleton satisfies a weak rectangle-like property

Semi-Rectangle Property of Skeletons Inputs mapped to the transcript Skeleton:  For “most” coordinate pairs (i,(i))  For any assignment to x j and y (j), 8 j  i  The inputs of the skeleton restricted to this assignment and then projected to (i,(i)) is a rectangle Transcript in c.c.: Rectangle

Working with Skeletons  In [GS,GHS], the proofs use only one coordinate pair  For Set Disjointness, the distribution on a single coordinate is skewed towards the 0’s of the function With 2-sided error, we cannot hope for a similar lower bound Therefore, we keep track of multiple coordinate pairs  Tricky part: keeping track of the inputs as we vary the coordinate pairs

Remarks  Currently, our direct-sum framework works for primitive functions that have high discrepancy or corruption It would be nice to have an information complexity based approach  We consider two kinds of composition operators: © and _  Yields lower bounds for Intersection Size Mod 2 (Inner Product)

Summary  We have powerful techniques from combinatorics, information theory, Fourier analysis to tackle problems of “information flow” in massive data set computations  Techniques that have also influenced complexity theory E.g. [J.,Kumar,Sivakumar] resolved open questions in communication complexity  Promise problems still pose a challenge Gap-Hamming for multiple passes