Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams
Space…the Final Frontier. These are the voyages of the starship Enterprise. Its five-year mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no man has gone before. -Star Trek
Traditionally, “efficient” computation is identified with polynomial time Clearly, not adequate for computations over massive data sets Similarly, we want a single notion of efficient computation over massive datasets Efficient Algorithms for Massive Data Sets
A Single Theory? Modern computing systems are complex and varied Memory architectures Distributed computing Randomization Etc. Paradigms of computing Sampling, sketching, data stream, read-write streams, stream-sort, map-reduce And many more yet to come
Lower Bounds This is a fertile ground for proving results Many successes Certain problems seem to be fundamental Reductions play a big role
Sampling: Query a small number of data elements Data streams: Stream through the data in a one-way fashion; limited main memory storage Models with Limited Main Memory Algorithm Data Set Algorithm Data Set
Distributed Computing Sketching: Compress data chunks into small “sketches”; compute over the sketches Algorithm Data Set
Sampling: Lower Bounds for Symmetric Functions In general, sampling algorithms are adaptive Proof Idea Let T be a sampling algorithm for the function Randomly permute the data elements Run T Resulting algorithm estimates the function and uses uniform samples Theorem [Bar-Yossef, Kumar, Sivakumar] When estimating symmetric functions, uniform sampling is the best possible.
Lower Bounds for Uniform Sampling Tools: block sensitivity Hellinger distance Kullback-Leibler divergence Jensen-Shannon divergence Combinatorics [Nisan] Statistics [Bar Yossef et al.] Information theory [Bar-Yossef]
Example Find the mean of n numbers in [0,1] Requires (1/ 2 ) samples to approximate additively within Lower Bound proof using Hellinger distance
Step 1 Simplify to a decision problem a : ½ + ε 0’s and ½ - ε 1’s b : ½ - ε 0’s and ½ + ε 1’s Given x 2 {a,b}, any sampling algorithm for mean (with additive error /4) can distinguish whether x=a or x=b
Step 2 Let P a = distribution on {0,1} by sampling uniformly from a; Similarly P b … Compute Hellinger distance h 2 (P a,P b ) For discrete distributions P, Q h 2 (P,Q) = k √P - √Q k 2 = 1 - Σ x (P(x) Q(x)) ½ h 2 (P a,P b ) = O( 2 )
Lower bound via Hellinger Distance Key Idea: multiplicative property of Hellinger distance: 1 - h 2 (P k,Q k ) = (1 - h 2 (P,Q)) k Theorem. Any uniform sampling algorithm needs Ω(1/ h 2 (P a,P b )) samples to distinguish input a from input b
Lower Bounds for Data Streams Idea is to somehow bound the flow of information (yields space lower bounds) Model is too “fine-grained” to prove lower bounds directly Instead, we consider more powerful models (hopefully simpler to tackle)
Communication complexity xy f(x,y) Extensions to multiple parties Resources: # bits # rounds
Connection to Data Streams Data stream algorithm for f(x ± y) Space s Passes k ) O(ks), 2k round protocol for f(x,y) Data stream algorithm for f(x 1 ± x 2 ± ± x t ) Space s Passes k ) O(tks) protocol for f(x 1,x 2,…,x t )
Caveat Communication complexity usually deals with decision problems Data stream problems involve approximation computations Usual reduction techniques yield promise problems in c.c.
Set Disjointness Sets A,B µ [n] Alice has A and Bob has B Is A \ B ; ? Classical problem in c.c. t-party version [Alon,Matias,Szegedy]
C.C. Lower Bounds for Set Disjointness Remarks: Choose a “hard distribution” on inputs and show a lower bound on communication Unfortunately, the hard distributions here involves correlated inputs The arguments are somewhat tricky Theorem: Randomized c.c. of Disjointness is (n) [Kalyanasundaram,Schnitger; Razborov]
Direct Sum Methodology x and y are characteristic vectors AND (a,b) = a ^ b INT (x,y) = _ i (x i ^ y i ) = _ i AND (x i,y i ) Establish that any protocol for INT must solve n independent copies of AND This is not true for communication itself !
Information Cost P a protocol for a function f [Chakrabarti, Sun, Wirth, Yao] Information cost of P = I(X,Y : P(X,Y)) X,Y are suitably distributed I( : ) denotes Shannon mutual information [Bar-Yossef, J., Kumar, Sivakumar] Conditional Information Cost of P = I(X,Y : P(X,Y) | D)
Information Complexity Let be a distribution on inputs IC(f) = minimum information cost of a protocol computing f where the inputs are distributed according to
Proposition. CC(f) IC(f) Proof: Let P compute f I(X,Y : P(X,Y) | D) H(P(X,Y) | D) H(P(X,Y))(conditioning reduces entropy) |P|(entropy bounded by bits) Information vs Communication Complexity
Distribution for Disjointness For each i = 1, …, n, independently: D i R {a,b} If D i = a then x i = 0, y i R {0,1} If D i = b then x i R {0,1}, y i = 0 Remarks: This always produces disjoint sets! Conditioned on D, X and Y are independent
Direct Sum Theorem Theorem. IC( INT ) ¸ n ¢ IC( AND ) …… X1X1 X2X2 Y1Y1 Y2Y2 a 00 0 b XnXn YnYn
Information Complexity of AND Nice connections to statistical distances In case of AND, this reduces to getting a lower bound on the Hellinger distance: h 2 ( AND (0,1), AND (1,0)) a b
More Thoughts Extension to t-party set disjointness: lower bound of (n/t 2 ) Can be improved to (n/(t log t)) [Chakrabarti,Khot,Sun] Yields optimal space lower bounds for frequency moments F k, k > 2 Method also gives optimal bounds for L 1 [Saks,Sun] proved similar bounds for 1 pass For L p, p>2, the space bound is polynomial with a minor gap between u.b. and l.b. in terms of p
Reductions – Example for F 0 Indexing: Alice holds a set A of size n/2 Bob holds an element b Is b 2 A? One-way c.c. of Indexing is (n) Shatter coefficients are useful here [BJKS] F 0 = n/2 or n/2+1 Gap can be amplified by padding Yields a (1/) bound Improved to (1/ 2 ) but requires substantial new ideas [Indyk,Woodruff; Woodruff]
Lower bounds for Sketching Simultaneous messages f(x 1,x 2,…,x t ) A 1 (x 1 ) A t (x t ) A 2 (x 2 )
Beyond Data Streams: a Peek at External Memory Efficient access to external memory is possible in restricted ways I/O rates for sequential read/write access to disks are as good as random access to main memory New models of I/O-efficient computing Read/write streams [Grohe,Schweikardt; G,Hernich,S] StrSort [Aggarwal,Datar,Rajagopalan,Ruhl] Map-reduce [Dean,Ghemawat]
Read/Write Streams Also called Reversal Turing Machines by [GS] Input t streams Machine Memory
Critical Resources #tapes t space s No constraint on the length of streams But #reversals is at most r An (r,s,t) read/write stream algorithm Sorting has an (O(log N), O(log N), O(1)) read/write stream algorithm What happens when #reversals is o(log N)?
Lower Bounds There is no reduction using c.c. E.g. Equality of strings is easy here So what gives? The intuition is that it is hard to compare data elements at random locations Grohe and Schweikardt formalize this and give a nice lower bound technique later extended to 1-sided error [GHS]
ymym y2y2 y (m) y (2) Difficult Problems for Read/Write Streams A direct-sum type of problem with inputs moved around … h g x1x1 y1y1 y (1) g x2x2 g xmxm Pick a permutation with small monotonicity
Previous Results Sorting with o(log N) reversals requires (N 1/5 ) space [GS] Set Equality with o(log N) reversals requires (N 1/4 ) space [GHS] Also applies to Sorting Bounds hold for deterministic and randomized 1-sided error models
Our Results [Beame, J., Rudra] Lower bounds for 2-sided error randomized computation Set Disjointness with o(log N/log log N) reversals requires near-linear space We derive our results in a direct-sum framework
Lower Bound Technique 1 st step: List machine records the potential ways in which subsets of input elements can be “compared” at different stages of the computation 2 nd step: Skeleton Describes the information flow in terms of the locations of elements that are compared
Key Theorem of [GH,GSH] Skeletons resemble transcripts in c.c. Theorem. The skeletons partition the input domain such that (1) #skeletons is “small” (2) output depends only on the skeleton (3) Each skeleton satisfies a weak rectangle-like property
Semi-Rectangle Property of Skeletons Inputs mapped to the transcript Skeleton: For “most” coordinate pairs (i,(i)) For any assignment to x j and y (j), 8 j i The inputs of the skeleton restricted to this assignment and then projected to (i,(i)) is a rectangle Transcript in c.c.: Rectangle
Working with Skeletons In [GS,GHS], the proofs use only one coordinate pair For Set Disjointness, the distribution on a single coordinate is skewed towards the 0’s of the function With 2-sided error, we cannot hope for a similar lower bound Therefore, we keep track of multiple coordinate pairs Tricky part: keeping track of the inputs as we vary the coordinate pairs
Remarks Currently, our direct-sum framework works for primitive functions that have high discrepancy or corruption It would be nice to have an information complexity based approach We consider two kinds of composition operators: © and _ Yields lower bounds for Intersection Size Mod 2 (Inner Product)
Summary We have powerful techniques from combinatorics, information theory, Fourier analysis to tackle problems of “information flow” in massive data set computations Techniques that have also influenced complexity theory E.g. [J.,Kumar,Sivakumar] resolved open questions in communication complexity Promise problems still pose a challenge Gap-Hamming for multiple passes