Information Theory for Data Streams David P. Woodruff IBM Almaden.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
Probability Theory Part 1: Basic Concepts. Sample Space - Events  Sample Point The outcome of a random experiment  Sample Space S The set of all possible.
Chain Rules for Entropy
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Randomized Algorithms Randomized Algorithms CS648 Lecture 8 Tools for bounding deviation of a random variable Markov’s Inequality Chernoff Bound Lecture.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Seminar on Random Walks on Graphs 2009/2010 Ilan Ben Bassat Omri Weinstein Mixing Time – General Chains.
On the tightness of Buhrman- Cleve-Wigderson simulation Shengyu Zhang The Chinese University of Hong Kong On the relation between decision tree complexity.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Avraham Ben-Aroya (Tel Aviv University) Oded Regev (Tel Aviv University) Ronald de Wolf (CWI, Amsterdam) A Hypercontractive Inequality for Matrix-Valued.
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.
Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Shengyu Zhang The Chinese University of Hong Kong.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Lower Bounds for Massive Dataset Algorithms T.S. Jayram (IBM Almaden) IIT Kanpur Workshop on Data Streams.
Some basic concepts of Information Theory and Entropy
§1 Entropy and mutual information
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
Entropy-based Bounds on Dimension Reduction in L 1 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAAA A Oded Regev.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
COMP 170 L2 L17: Random Variables and Expectation Page 1.
The Cost of Fault Tolerance in Multi-Party Communication Complexity Binbin Chen Advanced Digital Sciences Center Haifeng Yu National University of Singapore.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Gillat Kol (IAS) joint work with Anat Ganor (Weizmann) Ran Raz (Weizmann + IAS) Exponential Separation of Information and Communication.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
The Cost of Fault Tolerance in Multi-Party Communication Complexity Haifeng Yu National University of Singapore Joint work with Binbin Chen, Yuda Zhao,
Beating the Direct Sum Theorem in Communication Complexity with Implications for Sketching Marco Molinaro David Woodruff Grigory Yaroslavtsev Georgia Tech.
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Introduction to Information theory
Sketching and Embedding are Equivalent for Norms
COT 5611 Operating Systems Design Principles Spring 2012
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
COT 5611 Operating Systems Design Principles Spring 2014
Tim Holliday Peter Glynn Andrea Goldsmith Stanford University
Quantum Information Theory Introduction
The Communication Complexity of Distributed Set-Joins
Imperfectly Shared Randomness
Streaming Symmetric Norms via Measure Concentration
Presentation transcript:

Information Theory for Data Streams David P. Woodruff IBM Almaden

Talk Outline 1.Information Theory Concepts 2.Distances Between Distributions 3.An Example Communication Lower Bound – Randomized 1-way Communication Complexity of the INDEX problem

Discrete Distributions

Entropy (symmetric)

Conditional and Joint Entropy

Chain Rule for Entropy

Conditioning Cannot Increase Entropy continuous

Conditioning Cannot Increase Entropy

Mutual Information (Mutual Information) I(X ; Y) = H(X) – H(X | Y) = H(Y) – H(Y | X) = I(Y ; X) Note: I(X ; X) = H(X) – H(X | X) = H(X) (Conditional Mutual Information) I(X ; Y | Z) = H(X | Z) – H(X | Y, Z)

Chain Rule for Mutual Information

Fano’s Inequality Here X -> Y -> X’ is a Markov Chain, meaning X’ and X are independent given Y. “Past and future are conditionally independent given the present” To prove Fano’s Inequality, we need the data processing inequality

Data Processing Inequality

Proof of Fano’s Inequality

Tightness of Fano’s Inequality

Tightness of Fano’s Inequality

Talk Outline 1.Information Theory Concepts 2.Distances Between Distributions 3.An Example Communication Lower Bound – Randomized 1-way Communication Complexity of the INDEX problem

Distances Between Distributions

Why Hellinger Distance?

Product Property of Hellinger Distance

Jensen-Shannon Distance l

Relations Between Distance Measures ½ + δ/2

Talk Outline 1.Information Theory Concepts 2.Distances Between Distributions 3.An Example Communication Lower Bound – Randomized 1-way Communication Complexity of the INDEX problem

Randomized 1-Way Communication Complexity x 2 {0,1} n j 2 {1, 2, 3, …, n} INDEX PROBLEM

1-Way Communication Complexity of Index

1-Way Communication of Index Continued

Typical Communication Reduction a 2 {0,1} n Create stream s(a) b 2 {0,1} n Create stream s(b) Lower Bound Technique 1. Run Streaming Alg on s(a), transmit state of Alg(s(a)) to Bob 2. Bob computes Alg(s(a), s(b)) 3. If Bob solves g(a,b), space complexity of Alg at least the 1- way communication complexity of g

Example: Distinct Elements Give a 1, …, a m in [n], how many distinct numbers are there? Index problem: Alice has a bit string x in {0, 1} n Bob has an index i in [n] Bob wants to know if x i = 1 Reduction: s(a) = i 1, …, i r, where i j appears if and only if x i j = 1 s(b) = i If Alg(s(a), s(b)) = Alg(s(a))+1 then x i = 0, otherwise x i = 1 Space complexity of Alg at least the 1-way communication complexity of Index

Strengthening Index: Augmented Indexing Augmented-Index problem: Alice has x 2 {0, 1} n Bob has i 2 [n], and x 1, …, x i-1 Bob wants to learn x i Similar proof shows  (n) bound I(M ; X) = sum i I(M ; X i | X < i ) = n – sum i H(X i | M, X < i ) By Fano’s inequality, H(X i | M, X 1- δ from M, X < i CC δ (Augmented-Index) > I(M ; X) ¸ n(1-H(δ))

Lower Bounds for Counting with Deletions

Gap-Hamming Problem x 2 {0,1} n y 2 {0,1} n Promise: Hamming distance satisfies Δ(x,y) > n/2 + εn or Δ(x,y) < n/2 - εn Lower bound of Ω(ε -2 ) for randomized 1-way communication [Indyk, W], [W], [Jayram, Kumar, Sivakumar] Gives Ω(ε -2 ) bit lower bound for approximating number of distinct elements Same for 2-way communication [Chakrabarti, Regev]

Gap-Hamming From Index [JKS] E[Δ(a,b)] = t/2 + x i ¢ t 1/2 x 2 {0,1} t i 2 [t] t = ε -2 Public coin = r 1, …, r t, each in {0,1} t a 2 {0,1} t b 2 {0,1} t a k = Majority j such that x j = 1 r k j b k = r k i