Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams

Space…the Final Frontier. These are the voyages of the starship Enterprise. Its five-year mission: to explore strange new worlds, to seek out new life and new civilizations, to boldly go where no man has gone before. -Star Trek

 Traditionally, “efficient” computation is identified with polynomial time Clearly, not adequate for computations over massive data sets  Similarly, we want a single notion of efficient computation over massive datasets Efficient Algorithms for Massive Data Sets

A Single Theory?  Modern computing systems are complex and varied Memory architectures Distributed computing Randomization Etc.  Paradigms of computing Sampling, sketching, data stream, read-write streams, stream-sort, map-reduce And many more yet to come

Lower Bounds  This is a fertile ground for proving results  Many successes  Certain problems seem to be fundamental  Reductions play a big role

 Sampling: Query a small number of data elements  Data streams: Stream through the data in a one-way fashion; limited main memory storage Models with Limited Main Memory Algorithm Data Set Algorithm Data Set

Distributed Computing  Sketching: Compress data chunks into small “sketches”; compute over the sketches Algorithm Data Set

Sampling: Lower Bounds for Symmetric Functions  In general, sampling algorithms are adaptive Proof Idea Let T be a sampling algorithm for the function Randomly permute the data elements Run T Resulting algorithm estimates the function and uses uniform samples Theorem [Bar-Yossef, Kumar, Sivakumar] When estimating symmetric functions, uniform sampling is the best possible.

Lower Bounds for Uniform Sampling Tools: block sensitivity Hellinger distance Kullback-Leibler divergence Jensen-Shannon divergence Combinatorics [Nisan] Statistics [Bar Yossef et al.] Information theory [Bar-Yossef]

Example  Find the mean of n numbers in [0,1]  Requires (1/ 2 ) samples to approximate additively within   Lower Bound proof using Hellinger distance

Step 1  Simplify to a decision problem a : ½ + ε 0’s and ½ - ε 1’s b : ½ - ε 0’s and ½ + ε 1’s  Given x 2 {a,b}, any sampling algorithm for mean (with additive error  /4) can distinguish whether x=a or x=b

Step 2  Let P a = distribution on {0,1} by sampling uniformly from a; Similarly P b …  Compute Hellinger distance h 2 (P a,P b ) For discrete distributions P, Q h 2 (P,Q) = k √P - √Q k 2 = 1 - Σ x (P(x) Q(x)) ½ h 2 (P a,P b ) = O( 2 )

Lower bound via Hellinger Distance Key Idea: multiplicative property of Hellinger distance: 1 - h 2 (P k,Q k ) = (1 - h 2 (P,Q)) k Theorem. Any uniform sampling algorithm needs Ω(1/ h 2 (P a,P b )) samples to distinguish input a from input b

Lower Bounds for Data Streams  Idea is to somehow bound the flow of information (yields space lower bounds)  Model is too “fine-grained” to prove lower bounds directly  Instead, we consider more powerful models (hopefully simpler to tackle)

Communication complexity xy f(x,y) Extensions to multiple parties Resources: # bits # rounds

Connection to Data Streams Data stream algorithm for f(x ± y) Space s Passes k ) O(ks), 2k round protocol for f(x,y)  Data stream algorithm for f(x 1 ± x 2 ±  ± x t ) Space s Passes k ) O(tks) protocol for f(x 1,x 2,…,x t )

Caveat  Communication complexity usually deals with decision problems  Data stream problems involve approximation computations  Usual reduction techniques yield promise problems in c.c.

Set Disjointness  Sets A,B µ [n]  Alice has A and Bob has B Is A \ B  ; ?  Classical problem in c.c.  t-party version [Alon,Matias,Szegedy]

C.C. Lower Bounds for Set Disjointness Remarks: Choose a “hard distribution” on inputs and show a lower bound on communication Unfortunately, the hard distributions here involves correlated inputs The arguments are somewhat tricky Theorem: Randomized c.c. of Disjointness is (n) [Kalyanasundaram,Schnitger; Razborov]

Direct Sum Methodology  x and y are characteristic vectors AND (a,b) = a ^ b INT (x,y) = _ i (x i ^ y i ) = _ i AND (x i,y i )  Establish that any protocol for INT must solve n independent copies of AND  This is not true for communication itself !

Information Cost P a protocol for a function f [Chakrabarti, Sun, Wirth, Yao] Information cost of P = I(X,Y : P(X,Y)) X,Y are suitably distributed I( : ) denotes Shannon mutual information [Bar-Yossef, J., Kumar, Sivakumar] Conditional Information Cost of P = I(X,Y : P(X,Y) | D)

Information Complexity  Let  be a distribution on inputs  IC(f) = minimum information cost of a protocol computing f where the inputs are distributed according to 

Proposition. CC(f)  IC(f) Proof: Let P compute f I(X,Y : P(X,Y) | D)  H(P(X,Y) | D)  H(P(X,Y))(conditioning reduces entropy)  |P|(entropy bounded by bits) Information vs Communication Complexity

Distribution for Disjointness  For each i = 1, …, n, independently: D i  R {a,b} If D i = a then x i = 0, y i  R {0,1} If D i = b then x i  R {0,1}, y i = 0 Remarks:  This always produces disjoint sets!  Conditioned on D, X and Y are independent

Direct Sum Theorem Theorem. IC( INT ) ¸ n ¢ IC( AND ) ……   X1X1 X2X2 Y1Y1 Y2Y2 a 00 0 b XnXn YnYn

Information Complexity of AND  Nice connections to statistical distances  In case of AND, this reduces to getting a lower bound on the Hellinger distance: h 2 ( AND (0,1), AND (1,0)) 0 1 01 a b

More Thoughts  Extension to t-party set disjointness: lower bound of (n/t 2 ) Can be improved to (n/(t log t)) [Chakrabarti,Khot,Sun] Yields optimal space lower bounds for frequency moments F k, k > 2  Method also gives optimal bounds for L 1 [Saks,Sun] proved similar bounds for 1 pass For L p, p>2, the space bound is polynomial with a minor gap between u.b. and l.b. in terms of p

Reductions – Example for F 0  Indexing: Alice holds a set A of size n/2 Bob holds an element b Is b 2 A?  One-way c.c. of Indexing is (n) Shatter coefficients are useful here [BJKS]  F 0 = n/2 or n/2+1 Gap can be amplified by padding Yields a (1/) bound Improved to (1/ 2 ) but requires substantial new ideas [Indyk,Woodruff; Woodruff]

Lower bounds for Sketching Simultaneous messages f(x 1,x 2,…,x t ) A 1 (x 1 ) A t (x t ) A 2 (x 2 )

Beyond Data Streams: a Peek at External Memory  Efficient access to external memory is possible in restricted ways I/O rates for sequential read/write access to disks are as good as random access to main memory  New models of I/O-efficient computing Read/write streams [Grohe,Schweikardt; G,Hernich,S] StrSort [Aggarwal,Datar,Rajagopalan,Ruhl] Map-reduce [Dean,Ghemawat]

Read/Write Streams  Also called Reversal Turing Machines by [GS] Input t streams Machine Memory

Critical Resources  #tapes t  space s  No constraint on the length of streams  But #reversals is at most r  An (r,s,t) read/write stream algorithm Sorting has an (O(log N), O(log N), O(1)) read/write stream algorithm What happens when #reversals is o(log N)?

Lower Bounds  There is no reduction using c.c. E.g. Equality of strings is easy here So what gives?  The intuition is that it is hard to compare data elements at random locations  Grohe and Schweikardt formalize this and give a nice lower bound technique later extended to 1-sided error [GHS]

ymym y2y2 y (m) y (2) Difficult Problems for Read/Write Streams  A direct-sum type of problem with inputs moved around … h g x1x1 y1y1 y (1) g x2x2 g xmxm Pick a permutation  with small monotonicity

Previous Results  Sorting with o(log N) reversals requires (N 1/5 ) space [GS]  Set Equality with o(log N) reversals requires (N 1/4 ) space [GHS] Also applies to Sorting  Bounds hold for deterministic and randomized 1-sided error models

Our Results [Beame, J., Rudra]  Lower bounds for 2-sided error randomized computation  Set Disjointness with o(log N/log log N) reversals requires near-linear space  We derive our results in a direct-sum framework

Lower Bound Technique 1 st step: List machine  records the potential ways in which subsets of input elements can be “compared” at different stages of the computation 2 nd step: Skeleton  Describes the information flow in terms of the locations of elements that are compared

Key Theorem of [GH,GSH]  Skeletons resemble transcripts in c.c. Theorem. The skeletons partition the input domain such that (1) #skeletons is “small” (2) output depends only on the skeleton (3) Each skeleton satisfies a weak rectangle-like property

Semi-Rectangle Property of Skeletons Inputs mapped to the transcript Skeleton:  For “most” coordinate pairs (i,(i))  For any assignment to x j and y (j), 8 j  i  The inputs of the skeleton restricted to this assignment and then projected to (i,(i)) is a rectangle Transcript in c.c.: Rectangle

Working with Skeletons  In [GS,GHS], the proofs use only one coordinate pair  For Set Disjointness, the distribution on a single coordinate is skewed towards the 0’s of the function With 2-sided error, we cannot hope for a similar lower bound Therefore, we keep track of multiple coordinate pairs  Tricky part: keeping track of the inputs as we vary the coordinate pairs

Remarks  Currently, our direct-sum framework works for primitive functions that have high discrepancy or corruption It would be nice to have an information complexity based approach  We consider two kinds of composition operators: © and _  Yields lower bounds for Intersection Size Mod 2 (Inner Product)

Summary  We have powerful techniques from combinatorics, information theory, Fourier analysis to tackle problems of “information flow” in massive data set computations  Techniques that have also influenced complexity theory E.g. [J.,Kumar,Sivakumar] resolved open questions in communication complexity  Promise problems still pose a challenge Gap-Hamming for multiple passes

Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Similar presentations

Presentation on theme: "Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.

Similar presentations

Presentation on theme: "Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams."— Presentation transcript:

Similar presentations

About project

Feedback