Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Quantum One-Way Communication is Exponentially Stronger than Classical Communication TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
The Communication Complexity of Approximate Set Packing and Covering
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
On the tightness of Buhrman- Cleve-Wigderson simulation Shengyu Zhang The Chinese University of Hong Kong On the relation between decision tree complexity.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Lecture 18 Syntactic Web Clustering CS
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Evaluating Hypotheses
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center.
Dept. of Computer Science Distributed Computing Group Asymptotically Optimal Mobile Ad-Hoc Routing Fabian Kuhn Roger Wattenhofer Aaron Zollinger.
1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
1 Information and interactive computation January 16, 2012 Mark Braverman Computer Science, Princeton University.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
Information Complexity: an Overview Rotem Oshman, Princeton CCI Based on work by Braverman, Barak, Chen, Rao, and others Charles River Science of Information.
Umans Complexity Theory Lectures Lecture 7b: Randomization in Communication Complexity.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Review of Statistical Terms Population Sample Parameter Statistic.
Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.
1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 467 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 3524 Course.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Gillat Kol (IAS) joint work with Anat Ganor (Weizmann) Ran Raz (Weizmann + IAS) Exponential Separation of Information and Communication.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Random Access Codes and a Hypercontractive Inequality for
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
Computing and Compressive Sensing in Wireless Sensor Networks
Circuit Lower Bounds A combinatorial approach to P vs NP
Streaming & sampling.
Chapter 7: Sampling Distributions
Branching Programs Part 3
Effcient quantum protocols for XOR functions
Lecture 4: CountSketch High Frequencies
Linear sketching with parities
The Curve Merger (Dvir & Widgerson, 2008)
Linear sketching over
Linear sketching with parities
Range-Efficient Computation of F0 over Massive Data Streams
Imperfectly Shared Randomness
Streaming Symmetric Norms via Measure Concentration
Presentation transcript:

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion

What are Massive Data Sets? Technology The World-Wide Web IP packet flows Phone call logs Technology The World-Wide Web IP packet flows Phone call logs Science Astronomical sky surveys Weather data Science Astronomical sky surveys Weather data Business Credit card transactions Billing records Supermarket sales Business Credit card transactions Billing records Supermarket sales

Traditionally Cope with the complexity of the problem Traditionally Cope with the complexity of the problem New challenges Restricted access to the data Not enough time to read the whole data Tiny fraction of the data can be held in main memory Massive Data Sets Cope with the complexity of the data Massive Data Sets Cope with the complexity of the data Nontraditional Challenges

Approximation of Computing over Massive Data Sets Data (n is very large) Approximation of f(x) is sufficient Program can be randomized Computer Program Examples Mean Parity

Models for Computing over Massive Data Sets Sampling Data Streams Sketching

Query a few data items Sampling Data (n is very large) Computer Program Examples Mean O(1) queries Parity n queries Approximation of

Data Streams Data (n is very large) Computer Program Stream through the data; Use limited memory Examples Mean O(1) memory Parity 1 bit of memory Approximation of

Sketching Data1 (n is very large) Data2 Data1 Data2 Sketch2 Sketch1 Compress each data segment into a small “sketch” Compute over the sketches Examples Equality O(1) size sketch Hamming distance O(1) size sketch L p distance (p > 2)  n 1-2/p ) size sketch Approximation of

Algorithms for Massive Data Sets Mean and other moments Median and other quantiles Volume estimations Histograms Graph problems Low-rank matrix approximations Sampling Frequency moments Distinct elements Functional approximations Geometric problems Graph problems Database problems Data Streams Equality Hamming distance Sketching Edit distance L p distance

Our goal Study the limits of computing over massive data sets Study the limits of computing over massive data sets Query complexity lower bounds Query complexity lower bounds Data stream memory lower bounds Data stream memory lower bounds Sketch size lower bounds Sketch size lower bounds Main Tools Communication complexity, information theory, statistics Main Tools Communication complexity, information theory, statistics

CC(f) = min  :  computes f cost(  ) Communication Complexity [Yao 79] Alice Bob m1m1 m2m2 m3m3 m4m4 Referee cost(  ) =  i |m i |  a,b) “transcript”

Communication Complexity View of Sampling Alice Bob i1i1 X[i 1 ] i2i2 X[i 2 ] Referee cost(  ) = # of queries QC(f) = min  computes f cost(  )  x) “transcript” Approximation of

icost  (  ) = I (X;  (X)); IC  (f) = min  computes f icost  (  ) Information Complexity [Chakrabarti, Shi, Wirth, Yao 01]  distribution on inputs to f X: random variable with distribution  Information Complexity: Minimum amount of information a protocol that computes f has to reveal about its inputs Information Complexity: Minimum amount of information a protocol that computes f has to reveal about its inputs Note: For some functions, any protocol must reveal much more information about X than just f(X).

CC Lower Bounds via IC Lower Bounds Useful properties of information complexity: Lower bounds communication complexity Amenable to “direct sum” decompositions Framework for bounding CC via IC 1.Find an appropriate “hard input distribution” . 2.Prove a lower bound on IC  (f) 1.Decomposition: decompose IC  (f) into “simple” information quantities. 2.Basis: Prove a lower bound on the simple quantities

Applications of Information Complexity Data streams [Bar-Yossef, Jayram, Kumar, Sivakumar 02] [Chakrabarti, Khot, Sun 03] Sampling [Bar-Yossef 03] Communication complexity and decision tree complexity [Jayram, Kumar, Sivakumar 03] Quantum communication complexity [Jain, Radhakrishnan, Sen 03] Cell probe [Sen 03] [Chakrabarti, Regev 04] Simultaneous messages [Chakrabarti, Shi, Wirth, Yao 01]

The “ Election Problem” Input: a sequence x of n votes to k parties 7/18 4/183/182/18 1/18 (n = 18, k = 6) Want to get D s.t. ||D – f(x)|| < . Vote Distribution f(x) Theorem QC(f) =  (k/  2 )

Sampling Lower Bound Lemma 1 (Normal form lemma): WLOG, in any protocol  that computes the election problem, the queries are uniformly distributed and independent.  x) : transcript of a full protocol  (x) : transcript of a single random query “protocol” If cost(  ) = q, then  (x) = (  (x),…,  (x)) (q times)

Sampling Lower Bound (cont.) Lemma 2 (Decomposition lemma): For any X, I (X ;  (X)) <= q I (X;  (X)). I (X;  (X)) : information cost of  w.r.t. X I (X;  (X)) : information cost of  w.r.t. X Therefore, q >= I (X;  (X)) / I (X;  (X)).

Combinatorial Designs 1.Each of them constitutes half of U. 2.The intersection of each two of them is relatively small. B1B1 B2B2 B3B3 U A family of subsets B 1,…,B m of a universe U s.t. Fact: There exist designs of size exponential in |U|. (Constant rate, constant relative minimum distance binary error-correcting codes).

Hard Input Distribution for the Election Problem Let B 1,…,B m be a family of subsets of {1,…,k} that form a design of size m = 2  (k). X is uniformly chosen among x 1,…,x m, where in x i : BiBi BicBic ½ +  of the votes are split among parties in B i. ½ -  of the votes are split among parties in B i c. 1.Unique decoding: For every i,j, || f(x i ) – f(x j ) || > 2 . Therefore, I (X;  (X)) >= H(X) =  (k). 2.Low diameter: I (X;  (X)) = O(  2 ).

Conclusions Information theory plays a growingly major role in complexity theory. More applications of information complexity? Can we use deeper information theory in complexity theory?

Thank You