Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Quantum t-designs: t-wise independence in the quantum world Andris Ambainis, Joseph Emerson IQC, University of Waterloo.

The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Mining Data Streams.

Lower Bounds & Models of Computation Jeff Edmonds York University COSC 3101 Lecture 8.

Chain Rules for Entropy

Sketching for M-Estimators: A Unified Approach to Robust Regression

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

The Best Algorithms are Randomized Algorithms N. Harvey C&O Dept TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAA.

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.

The Polynomial Time Algorithm for Testing Primality George T. Gilbert.

Great Theoretical Ideas in Computer Science.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Information Complexity: an Overview Rotem Oshman, Princeton CCI Based on work by Braverman, Barak, Chen, Rao, and others Charles River Science of Information.

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Calculating frequency moments of Data Stream

Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling Michael J. Neely University of Southern.

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

The Message Passing Communication Model David Woodruff IBM Almaden.

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Mining of Massive Datasets Ch4. Mining Data Streams.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

SketchVisor: Robust Network Measurement for Software Packet Processing

Mining Data Streams (Part 1)

Streaming & sampling.

Distributed Submodular Maximization in Massive Datasets

Sublinear Algorithmic Tools 2

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Lecture 4: CountSketch High Frequencies

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Linear sketching with parities

General Strong Polarization

Range-Efficient Computation of F0 over Massive Data Streams

Streaming Symmetric Norms via Measure Concentration

Lecture 6: Counting triangles Dynamic graphs & sampling

By: Ran Ben Basat, Technion, Israel

Lecture 15: Least Square Regression Metric Embeddings

Minwise Hashing and Efficient Search

Ch5 Initial-Value Problems for ODE

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)

Streaming Model Increment x 1 x = (0, 0, 0, 0, …, 0) Algorithm x = (1, 0, 0, 0, …, 0) Algorithm x ∈ ℤ n m updates x = (1, 0, 0, 1, …, 0) Algorithm Increment x 4 Goal: Compute statistics, e.g. || x || 1, || x || 2 … Trivial solution: Store x (or store all updates) O(n·log(m)) space Goal: Compute using O(polylog(nm)) space x = (9, 2, 0, 5, …,12) Algorithm

Streaming Algorithms (a very brief introduction) Fact: [Alon-Matias-Szegedy ’99], [Bar-Yossef et al. ’02], [Indyk-Woodruff ’05], [Bhuvanagiri et al. ‘06], [Indyk ’06], [Li ’08], [Li ’09] Can compute (1±  ) = (1±  )F p using O(  -2 log c n) bits of space(if 0  p  2) O(  -O(1) n 1-2/p ∙ log O(1) (n)) bits(if 2<p  ) Another Fact: Mostly optimal: [Alon-Matias-Szegedy ‘99], [Bar-Yossef et al. ’02], [Saks-Sun ’02], [Chakrabarti-Khot-Sun ‘03], [Indyk-Woodruff ’03], [Woodruff ’04] – Proofs using communication complexity and information theory

Practical Motivation General goal: Dealing with massive data sets – Internet traffic, large databases, … Network monitoring & anomaly detection – Stream consists of internet packets – x i = # packets sent to port i – Under typical conditions, x is very concentrated – Under “port scan attack”, x less concentrated – Can detect by estimating empirical entropy [Lakhina et al. ’05], [Xu et al. ‘05], [Zhao et al. ‘07]

Entropy Probability distribution a = (a 1, a 2, …, a n ) Entropy H(a) = -Σ a i lg(a i ) Examples: – a = (1/n, 1/n, …, 1/n) : H(a) = lg(n) – a = (0, …, 0, 1, 0, …, 0) : H(a) = 0 small when concentrated, LARGE when not

Streaming Algorithms for Entropy How much space to estimate H(x)? – [Guha-McGregor-Venkatasubramanian ‘06], [Chakrabarti-Do Ba-Muthu ‘06], [Bhuvanagiri-Ganguly ‘06] – [Chakrabarti-Cormode-McGregor ‘07]: multiplicative (1±  ) approx : O(  -2 log 2 m) bits additive  approx: O(  -2 log 4 m) bits Ω(  -2 ) lower bound for both Our contributions: – Additive  or multiplicative (1±  ) approximation –Õ (  -2 log 3 m) bits, and can handle deletions – Can sketch entropy in the same space ~

First Idea If you can estimate F p for p≈1, then you can estimate H(x) Why? Rényi entropy

Review of Rényi Definition: Convergence to Shannon: H p (x) p 102… Alfred RényiClaude Shannon

Overview of Algorithm Set p=1.01 and let x = Compute Set So ~ ~ ~ ~ ~ (using Li’s “compressed counting”) As p  1 this gets better this gets worse! Analysis

Making the tradeoff How quickly does H p (x) converge to H(x)? Theorem: Let x be distr., with min i x i ≥ 1/m. Let. Then Plugging in: O(  -3 log 4 m) bits of space suffice for additive  approximation Multiplicative Approximation Additive Approximation ~ ~ ~ ~~ ~

Proof: A trick worth remembering Let f : ℝ  ℝ and g : ℝ  ℝ be such that l’Hopital’s rule says that It actually says more! It says converges to at least as fast as does.

Improvements Status: additive  approx using O(  -3 log 4 m) bits How to reduce space further? – Interpolate with multiple points: H p 1 (x), H p 2 (x),... H p (x) p 102… Shannon Multiple Rényis Single Rényi LEGEND

Analyzing Interpolation Let f(z) be a C k+1 function Interpolate f with polynomial q with q(z i )=f(z i ), 0≤i≤k Fact: where y, z i [a,b] Our case: Set f(z) = H 1+z (x) Goal: Analyze f (k+1) (z) H p (x) p 102…

Bounding Derivatives Rényi derivatives are messy to analyze Switch to Tsallis entropy f(z) = S 1+z (x), Can prove Tsallis also converges to Shannon ~ ~ ~ Define: (when a=-O(1/(k·log m)), b=0) can set k = log(1/ε)+loglog m Fact:

Key Ingredient: Noisy Interpolation We don’t have f(z i ), we have f(z i )±ε How to interpolate in presence of noise? Idea: we pick our z i very carefully

Chebyshev Polynomials Rogosinski’s Theorem: q(x) of degree k and |q(β j )|≤ 1 (0≤j≤k) |q(x)| ≤ |T k (x)| for |x| > 1 Map [-1,1] onto interpolation interval [z 0,z k ] Choose z j to be image of β j, j=0,…,k Let q(z) interpolate f(z j )±ε and q(z) interpolate f(z j ) r(z) = (q(z)-q(z))/ ε satisfies Rogosinski’s conditions! ~ ~

Tradeoff in Choosing z k z k close to 0 |T k (preimage(0))|still small …but z k close to 0 high space complexity Just how close do we need 0 and z k to be? T k grows quickly once leaving [z 0, z k ] z0z0 zkzk 0

The Magic of Chebyshev [Paturi ’92] :T k (1 + 1/k c ) ≤ e 4k 1-(c/2). Set c = 2. Suffices to set z k =-O(1/(k 3 log m)) Translates to Õ(  -2 log 3 m) space

The Final Algorithm (additive approximation) Set k = lg(1/  ) + lglg(m), z j = (k 2 cos(jπ/k)-(k 2 +1))/(9k 3 lg(m)) (0 ≤ j ≤ k) Estimate S 1+z j = (1-(F 1+z j /(F 1 ) 1+z j ))/z j for 0 ≤ j ≤ k Interpolate degree-k polynomial q(z j ) = S 1+z j Output q(0) ~ ~ ~ ~ ~

Multiplicative Approximation How to get multiplicative approximation? – Additive approximation is multiplicative, unless H(x) is small – H(x) small large [CCM ’07] Suppose and define We combine (1±ε)RF 1 and (1±ε)RF 1+z j to get (1±ε)f(z j ) Question: How do we get (1±ε)RF p ? Two different approaches: – A general approach (for any p, and negative frequencies) – An approach exploiting p ≈ 1, only for nonnegative freqs (better by log(m))

Questions / Thoughts For what other problems can we use this “generalize-then-interpolate” strategy? – Some non-streaming problems too? The power of moments? The power of residual moments? CountMin (CM ’05) + CountSketch (CCF ’02)  HSS (Ganguly et al.) WANTED : Faster moment estimation (some progress in [Cormode-Ganguly ’07])