Presentation is loading. Please wait.

Presentation is loading. Please wait.

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Similar presentations


Presentation on theme: "Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion."— Presentation transcript:

1 Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion

2 What are Massive Data Sets? Technology The World-Wide Web IP packet flows Phone call logs Technology The World-Wide Web IP packet flows Phone call logs Science Astronomical sky surveys Weather data Science Astronomical sky surveys Weather data Business Credit card transactions Billing records Supermarket sales Business Credit card transactions Billing records Supermarket sales

3 Traditionally Cope with the complexity of the problem Traditionally Cope with the complexity of the problem New challenges Restricted access to the data Not enough time to read the whole data Tiny fraction of the data can be held in main memory Massive Data Sets Cope with the complexity of the data Massive Data Sets Cope with the complexity of the data Nontraditional Challenges

4 Approximation of Computing over Massive Data Sets Data (n is very large) Approximation of f(x) is sufficient Program can be randomized Computer Program Examples Mean Parity

5 Models for Computing over Massive Data Sets Sampling Data Streams Sketching

6 Query a few data items Sampling Data (n is very large) Computer Program Examples Mean O(1) queries Parity n queries Approximation of

7 Data Streams Data (n is very large) Computer Program Stream through the data; Use limited memory Examples Mean O(1) memory Parity 1 bit of memory Approximation of

8 Sketching Data1 (n is very large) Data2 Data1 Data2 Sketch2 Sketch1 Compress each data segment into a small “sketch” Compute over the sketches Examples Equality O(1) size sketch Hamming distance O(1) size sketch L p distance (p > 2)  n 1-2/p ) size sketch Approximation of

9 Algorithms for Massive Data Sets Mean and other moments Median and other quantiles Volume estimations Histograms Graph problems Low-rank matrix approximations Sampling Frequency moments Distinct elements Functional approximations Geometric problems Graph problems Database problems Data Streams Equality Hamming distance Sketching Edit distance L p distance

10 Our goal Study the limits of computing over massive data sets Study the limits of computing over massive data sets Query complexity lower bounds Query complexity lower bounds Data stream memory lower bounds Data stream memory lower bounds Sketch size lower bounds Sketch size lower bounds Main Tools Communication complexity, information theory, statistics Main Tools Communication complexity, information theory, statistics

11 CC(f) = min  :  computes f cost(  ) Communication Complexity [Yao 79] Alice Bob m1m1 m2m2 m3m3 m4m4 Referee cost(  ) =  i |m i |  a,b) “transcript”

12 Communication Complexity View of Sampling Alice Bob i1i1 X[i 1 ] i2i2 X[i 2 ] Referee cost(  ) = # of queries QC(f) = min  computes f cost(  )  x) “transcript” Approximation of

13 icost  (  ) = I (X;  (X)); IC  (f) = min  computes f icost  (  ) Information Complexity [Chakrabarti, Shi, Wirth, Yao 01]  distribution on inputs to f X: random variable with distribution  Information Complexity: Minimum amount of information a protocol that computes f has to reveal about its inputs Information Complexity: Minimum amount of information a protocol that computes f has to reveal about its inputs Note: For some functions, any protocol must reveal much more information about X than just f(X).

14 CC Lower Bounds via IC Lower Bounds Useful properties of information complexity: Lower bounds communication complexity Amenable to “direct sum” decompositions Framework for bounding CC via IC 1.Find an appropriate “hard input distribution” . 2.Prove a lower bound on IC  (f) 1.Decomposition: decompose IC  (f) into “simple” information quantities. 2.Basis: Prove a lower bound on the simple quantities

15 Applications of Information Complexity Data streams [Bar-Yossef, Jayram, Kumar, Sivakumar 02] [Chakrabarti, Khot, Sun 03] Sampling [Bar-Yossef 03] Communication complexity and decision tree complexity [Jayram, Kumar, Sivakumar 03] Quantum communication complexity [Jain, Radhakrishnan, Sen 03] Cell probe [Sen 03] [Chakrabarti, Regev 04] Simultaneous messages [Chakrabarti, Shi, Wirth, Yao 01]

16 The “ Election Problem” Input: a sequence x of n votes to k parties 7/18 4/183/182/18 1/18 (n = 18, k = 6) Want to get D s.t. ||D – f(x)|| < . Vote Distribution f(x) Theorem QC(f) =  (k/  2 )

17 Sampling Lower Bound Lemma 1 (Normal form lemma): WLOG, in any protocol  that computes the election problem, the queries are uniformly distributed and independent.  x) : transcript of a full protocol  (x) : transcript of a single random query “protocol” If cost(  ) = q, then  (x) = (  (x),…,  (x)) (q times)

18 Sampling Lower Bound (cont.) Lemma 2 (Decomposition lemma): For any X, I (X ;  (X)) <= q I (X;  (X)). I (X;  (X)) : information cost of  w.r.t. X I (X;  (X)) : information cost of  w.r.t. X Therefore, q >= I (X;  (X)) / I (X;  (X)).

19 Combinatorial Designs 1.Each of them constitutes half of U. 2.The intersection of each two of them is relatively small. B1B1 B2B2 B3B3 U A family of subsets B 1,…,B m of a universe U s.t. Fact: There exist designs of size exponential in |U|. (Constant rate, constant relative minimum distance binary error-correcting codes).

20 Hard Input Distribution for the Election Problem Let B 1,…,B m be a family of subsets of {1,…,k} that form a design of size m = 2  (k). X is uniformly chosen among x 1,…,x m, where in x i : BiBi BicBic ½ +  of the votes are split among parties in B i. ½ -  of the votes are split among parties in B i c. 1.Unique decoding: For every i,j, || f(x i ) – f(x j ) || > 2 . Therefore, I (X;  (X)) >= H(X) =  (k). 2.Low diameter: I (X;  (X)) = O(  2 ).

21 Conclusions Information theory plays a growingly major role in complexity theory. More applications of information complexity? Can we use deeper information theory in complexity theory?

22 Thank You


Download ppt "Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion."

Similar presentations


Ads by Google