1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Quantum Lower Bounds The Polynomial and Adversary Methods Scott Aaronson September 14, 2001 Prelim Exam Talk.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Linear-Degree Extractors and the Inapproximability of Max Clique and Chromatic Number David Zuckerman University of Texas at Austin.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
The Communication Complexity of Approximate Set Packing and Covering
Distributional Property Estimation Past, Present, and Future Gregory Valiant (Joint work w. Paul Valiant)
Gillat Kol (IAS) joint work with Ran Raz (Weizmann + IAS) Interactive Channel Capacity.
Derandomized parallel repetition theorems for free games Ronen Shaltiel, University of Haifa.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
On the tightness of Buhrman- Cleve-Wigderson simulation Shengyu Zhang The Chinese University of Hong Kong On the relation between decision tree complexity.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Lecture 18 Syntactic Web Clustering CS
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
An Introduction to Black-Box Complexity
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
Lattices for Distributed Source Coding - Reconstruction of a Linear function of Jointly Gaussian Sources -D. Krithivasan and S. Sandeep Pradhan - University.
1 On The Learning Power of Evolution Vitaly Feldman.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Lower Bounds for Read/Write Streams Paul Beame Joint work with Trinh Huynh (Dang-Trinh Huynh-Ngoc) University of Washington.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
One-way multi-party communication lower bound for pointer jumping with applications Emanuele Viola & Avi Wigderson Columbia University IAS work done while.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 467 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 3524 Course.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
The Message Passing Communication Model David Woodruff IBM Almaden.
Clustering Data Streams A presentation by George Toderici.
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Streaming & sampling.
Sketching and Embedding are Equivalent for Norms
CS 154, Lecture 6: Communication Complexity
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Linear sketching with parities
Linear sketching over
Linear sketching with parities
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Imperfectly Shared Randomness
Streaming Symmetric Norms via Measure Concentration
Presentation transcript:

1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002

2 What Are Massive Data Sets? Examples The Web IP packets Supermarket transactions Telephone call graph Astronomical observations Characterizing properties Huge collections of raw data Data is generated and modified continuously Distributed over many sites Slow storage devices Data is not organized / indexed

3 Nontraditional Computational Challenges Restricted access to the data Random access: expensive “Streaming” access: more feasible Some data may be unavailable Fetching data is expensive Traditionally Cope with the difficulty of the problem Massive Date Sets Cope with the size of the data and the restricted access to it Sub-linear running time Ideally, independent of data size Sub-linear space Ideally, logarithmic in data size

4 Basic Framework Massive data set computations are typically: Approximate Randomized Have a restricted access regime Input Data Access Regime Algorithm Approximate Output $$ ($$ = randomness)

5 Prominent Computational Models for Massive Data Sets Sampling Computations –Sub-linear running time & space –Suitable for “insensitive” functions Data Stream Computations –Linear running time, sub-linear space –Can compute sensitive functions Sketch Computations –Suitable for distributed data

6 Sampling Computations Sampling Algorithm Approximation of f(x 1,…,x n ) x1x1 x2x2 xnxn Query input at random locations Can choose query distribution and can query adaptively Complexity measure: query complexity Applications –Statistical parameter estimation –Computational and statistical learning [Valiant 84, Vapnik 98] –Property testing [RS96,GGR96] $$

7 Data Stream Computations [HRR98, AMS96, FKSV99] x1x1 x2x2 x3x3 xnxn Data Stream Algorithm $$memory Input arrives in a one-way stream in arbitrary order Complexity measures: space and time per data item Approximation of f(x 1,…,x n ) Applications –Database (Frequency moments [AMS96]) –Networking (L p distance [AMS96, FKSV99, FS00, Indyk 00]) –Web Information Retrieval (Web crawling, Google query logs [CCF02])

8 Sketch Computations [GM98, BCFM98, FKSV99] compression Sketch Algorithm Approximation of f(x 11,…,x tk ) $$ Algorithm computes from data “sketches” sent from sites Complexity measure: sketch lengths Applications –Web Information Retrieval (Identifying document similarities [BCFM98]) –Networking (L p distance [FKSV99]) –Lossy compression, approximate nearest neighbor x 11 …x 1k x 21 …x 2k x t1 …x tk $$

9 Main Objective Develop general lower bound techniques Obtain lower bounds for specific functions Explore the limitations of the above computational models

10 General CC lower bounds [BJKS02b] Information Theory Communication Complexity Thesis Blueprint Statistical Decision Theory Sampling Computations Data Stream Computations Sketch Computations lower bounds for general functions [BKS01,B02] One-way and simultaneous CC lower bounds [BJKS02a] Reduction from simultaneous CC Reduction from one-way CC

11 Sampling Lower Bounds (with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002) Combinatorial lower bound [BKS01] –bounds the expected query complexity of every function –tends to be weak –based on a generalization of Boolean block sensitivity [Nisan 89] Statistical lower bounds –bound the query complexity of symmetric functions –via Hellinger distance: worst-case query complexity [BKS01] –via KL distance: expected query complexity [B02] –tend to be tight –work by a reduction from statistical hypothesis testing Information theory lower bound [B02] –bounds the worst-case query complexity of symmetric functions –has better dependence on the domain size

12 Main observation: Since for all x, w.p. 1 - , then: x,y  -disjoint  T(x),T(y) are “far” from each other Main Idea approximation set of x  -disjoint inputs approximation set of y approximation set of w  approximation:

13 Main Result Theorem For any symmetric f and  -disjoint inputs x,y, and for any algorithm that (  )-approximates f, Worst-case # of queries  1/h 2 (U x,U y ) log(1/  ) Expected # of queries  1/KL(U x,U y ) log(1/  ) U x – uniform query distribution on x: ( induced by: pick i u.a.r, output x i ) Hellinger: h 2 (U x,U y ) = 1 –  a (U x (a) U y (a)) ½ KL: KL(U x,U y ) =  a U x (a) log(U x (a) / U y (a))

14 Example: Mean ½ +  ½ -  ½ +  X:y: h 2 (U x,U y ) = KL(U x,U y ) = O(  2 ) Theorem (originally, [CEG95]) Approximating the mean of n numbers in [0,1] to within  additive error requires    log  queries. Other applications: Selection functions, frequency moments, extractors and dispersers

15 1.For symmetric functions, WLOG, all queries are uniform without replacement 2.If # of queries is  n ½, can further assume queries are uniform with replacement 3.For any  -disjoint inputs x,y, 4.Hypothesis testing lower bounds via Hellinger distance (worst-case) via KL distance (expected) (cf. [Siegmund 85]) Proof Outline  approximation of f with k queries Hypothesis test of U x against U y with error  and k samples

16 Statistical Hypothesis Testing Black Box PQ Hypothesis Test k i.i.d. samples Black box contains either P or Q Test has to decide: “P” or “Q” Allowed error probability  Goal: minimize k

17 Sampling Algorithm  Hypothesis Test x,y:  -disjoint inputs Black Box UxUx UyUy Sampling Algorithm “U y ” - otherwise “U x ” – if output k i.i.d. samples

18 Hypothesis test for U x against U y with error  and k samples Lower Bound via Hellinger Distance Lemma (cf. Le Cam, Yang 90) 11 22 Corollary: k  1/h 2 (U x,U y ) log(1/  )

19 Communication Complexity [Yao 79] Alice f: X  Y  Z x  X y  Y f(x,y) R  (f) = randomized CC of f with error  $$ Bob $$

20 Multi-Party Communication f: X 1  …  X t  Z P1P1 P2P2 P3P3 PtPt f(x 1,…,x t ) x1x1 x2x2 x3x3 xtxt

21 t-party set-disjointness Example: Set-disjointness P i gets S i  [n], Theorem [KS87,R90]:R  (Disj 2 ) =  (n) Theorem [AMS96]:R  (Disj t ) =  (n/t 4 ) Best upper bound:R  (Disj t ) =  O(n/t)

22 Restricted Communication Models P1P1 P2P2 PtPt Referee P1P1 P2P2 PtPt f(x 1,…,x t ) One-Way Communication [PS84, Ablayev 93, KNR95] Simultaneous Communication [Yao 79] Reduces to data stream computations Reduces to sketch computations

23 Example: Disjointness  Frequency Moments F k (a 1,…,a m ) =  j  [n] (f j ) k k-th frequency moment Theorem [AMS96]: Input stream: a 1,…,a m  [n], For j  [n], f j = # of occurrences of j in a 1,…,a m Corollary:DS(F k ) = n  (1), k > 5 Best upper bounds:DS(F k ) = n O(1), k > 2 DS(F k ) = O(log n), k=0,1,2

24 Information Statistics Approach to Communication Complexity (with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002) Applications General CC lower bounds –t-party set-disjointness:  (n/t 2 ) (improving on [AMS96]) –L p (solving an open problem of [Saks-Sun 02]) –Inner product One-way CC lower bounds –t-party set-disjointness:  (n/t 1+  ) for any  > 0 Space lower bounds in the data stream model –frequency moments: n  (1),k > 2 ( proving conjecture of [AMS96]) –L p distance A novel lower bound technique for randomized CC based on statistics and information theory

25 Statistical View of Communication Complexity  – a  -error randomized protocol for f: X  Y  Z  (x,y) – distribution over transcripts Lemma: For any two input pairs (x,y), (x’,y’) with f(x,y)  f(x’,y’), V(  (x,y),  (x’,y’))  1 – 2  Proof: By reduction from hypothesis testing. Corollary: h 2 (  (x,y),  (x’,y’))  1 – 2  ½

26 CC lower bound For a protocol  that computes f, how much information does  (x,y) have to reveal about (x,y)?  = (X,Y) – a distribution over inputs of f Definition:  -information cost icost  (  ) = I (X,Y ;  (X,Y)) icost  (f) = min  {icost  (  )} I (X,Y ;  (X,Y))  H(  (X,Y))  |  (X,Y)| Information cost lower bound Information Cost [Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02]

27 Direct Sum for Information Cost Decomposable functions: f(x,y) = g(h(x 1,y 1 ),…,h(x n,y n )), h: X i  Y i  {0,1}, g: {0,1} n  {0,1} Example: Set Disjointness Disj 2 (x,y) = (x 1 Λ  y 1 ) V … V (x n Λ  y n ) Theorem (direct sum): For appropriately chosen ,  ’, icost  (f)  n · icost  ’,  (h) Lower bound on icost(h) Lower bound on icost(f)

28 Information Cost of Single-Bit Functions In Disj 2,  ’ = ½  ’ 1 + ½  ’ 2, where:  ’ 1 = ½(1,0) + ½(0,0),  ’ 2 = ½(0,1) + ½(0,0) Lemma 1: For any protocol  for AND, icost  ’ (  )   (h 2 (  (0,1),  (1,0)) Lemma 2: h 2 (  (0,1),  (1,0)) = h 2 (  (1,1),  (0,0)) Corollary 1: icost  ’,  (AND)   (1 – 2  ½ ) Corollary 2: icost  (Disj 2 )   (n · (1 – 2  ½ ))

29 Proof of Lemma 2 “Rectangle” property of deterministic protocols: For any transcript , the set of all (x,y) with  (x,y) =  is a “combinatorial rectangle”: S  T, where S  X and T  Y “Rectangle” property of randomized protocols: For all x  X, y  Y, there exist functions p x : {0,1}*  [0,1] and q y : {0,1}*  [0,1], such that for any possible transcript , Pr(  (x,y) =  ) = p x (  ) · q y (  ) h 2 (  (0,1),  (1,0)) = 1 -   (Pr(  (0,1) =  ) · Pr(  (1,0) =  )) ½ = 1 –   (p 0 (  ) · q 1 (  ) · p 1 (  ) · q 0 (  )) ½ = h 2 (  (0,0),  (1,1))

30 Conclusions Studied limitations of computing on massive data sets –Sampling computations –Data stream computations –Sketch computations Lower bound methodologies are based on –Information theory –Statistical decision theory –Communication complexity Lower bound techniques: –Reveal novel aspects of the models –Present a “template” for obtaining specific lower bounds

31 Open Problems Sampling –Lower bounds for non-symmetric functions –Property testing lower bounds Communication complexity –Study the communication complexity of approximations –Tight lower bound for t-party set disjointness –Under what circumstances are one-way and simultaneous communication equivalent?

32 Thank You!

33 Yao’s Lemma [Yao 83] Definition:  -distributional CC (D  (f)) Complexity of best deterministic protocol that computes f with error   on inputs drawn according to  Yao’s Lemma: R  (f)  max  D  (f) Convenient technique to prove randomized CC lower bounds

34 Communication Complexity Lower Bounds via Information Theory (with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002) A novel information theory paradigm for proving CC lower bounds Applications –Characterization results: (w.r.t. product distributions) 1-way  simultaneous 2-party 1-way  t-party 1-way VC dimension characterization of t-party 1-way CC –Optimal lower bounds for simultaneous CC t-party set-disjointness:  (n/t) Generalized addressing function

35 Information Theory senderreceiver noisy channel m  M r  R M – distribution of transmitted messages R – distribution of received messages Goal of receiver: reconstruct m from r  g : error probability of a reconstruction function g Fano’s Inequality: For all g, H 2 (  g )  H(M | R) MLE Principle:  MLE  H(M | R) For a Boolean M

36 Information Theory View of Distributional CC x,y distribute according to  (X,Y) “God” transmits f(x,y) to Alice & Bob Alice & Bob receive the transcript  (x,y) Fano’s inequality: For any  -error protocol  for f, H 2 (  )  H(f(X,Y) |  (X,Y)) f(x,y)  (x,y) “God” Alice & Bob CC protocol

37 Simultaneous CC vs. One-Way CC Theorem For every product distribution  = X  Y, and every Boolean f, D ,2H(  ),sim (f)  D , ,A  B (f) + D , ,B  A (f) Proof A(x) – message of A on x in a  -error A  B protocol for f B(y) – message of B on y in a  -error B  A protocol for f Construct a SIM protocol for f: A  Referee: A(x)B  Referee: B(y) Referee outputs MLE(f(X,Y) | A(x), B(y))

38 Simultaneous CC vs. One-Way CC Proof (cont.) By MLE Principle, Pr  (MLE(f(X,Y) | A(X),B(Y))  f(X,Y))  H(f(X,Y) | A(X),B(Y)) By Fano, H(f(X,Y) | A(X),Y)  H 2 (  ) and H(f(X,Y) | X,B(Y))  H 2 (  ) Lemma For independent X,Y, H(f(X,Y) | A(X),B(Y))  H(f(X,Y) | A(X),Y) + H(f(X,Y) | X,B(Y))  Our protocol errs with probability at most 2H 2 (  ) □