Lecture 4: CountSketch High Frequencies

Slides:

Advertisements

Similar presentations

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

On the Power of Adaptivity in Sparse Recovery Piotr Indyk MIT Joint work with Eric Price and David Woodruff, 2011.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.

Compressed sensing Carlos Becker, Guillaume Lemaître & Peter Rennert

ECE Department Rice University dsp.rice.edu/cs Measurements and Bits: Compressed Sensing meets Information Theory Shriram Sarvotham Dror Baron Richard.

Infinite Horizon Problems

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

13. The Weak Law and the Strong Law of Large Numbers

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Sketching via Hashing: from Heavy Hitters to Compressive Sensing to Sparse Fourier Transform Piotr Indyk MIT.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology The Weak Law and the Strong.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Compressive Sensing for Multimedia Communications in Wireless Sensor Networks By: Wael BarakatRabih Saliba EE381K-14 MDDSP Literary Survey Presentation.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

Lecture 13 Compressive sensing

Lecture 22: Linearity Testing Sparse Fourier Transform

Finding Frequent Items in Data Streams

Estimating L2 Norm MIT Piotr Indyk.

Streaming & sampling.

Lecture 18: Uniformity Testing Monotonicity Testing

Lecture 11: Nearest Neighbor Search

Sublinear Algorithmic Tools 2

Counting How Many Elements Computing “Moments”

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Lecture 10: Sketching S3: Nearest Neighbor Search

Randomized Algorithms CS648

Lecture 7: Dynamic sampling Dimension Reduction

Lecture 16: Earth-Mover Distance

CSCI B609: “Foundations of Data Science”

Y. Kotidis, S. Muthukrishnan,

The Curve Merger (Dvir & Widgerson, 2008)

Sublinear Algorihms for Big Data

Summarizing Data by Statistics

Sudocodes Fast measurement and reconstruction of sparse signals

General Strong Polarization

CSCI B609: “Foundations of Data Science”

Data Structures Sorting Haim Kaplan & Uri Zwick December 2014.

13. The Weak Law and the Strong Law of Large Numbers

Lecture 6: Counting triangles Dynamic graphs & sampling

Lecture 15: Least Square Regression Metric Embeddings

President’s Day Lecture: Advanced Nearest Neighbor Search

Sublinear Algorihms for Big Data

13. The Weak Law and the Strong Law of Large Numbers

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Lecture 4: CountSketch High Frequencies COMS E6998-9 F15 Lecture 4: CountSketch High Frequencies Left to the title, a presenter can insert his/her own image pertinent to the presentation.

Plan Scriber? Plan: CountMin/CountSketch (continuing from last time) High frequency moments via Precision Sampling

Part 1: CountMin/CountSketch Let 𝑓 𝑖 be frequency of 𝑖 Last lecture: 2nd moment: 𝑖 𝑓 𝑖 2 Tug of War Max: heavy hitter CountMin IP Frequency 1 3 2 4 9 5 … 𝑛

CountMin: overall Heavy hitters: 𝑓 𝑖 ∑ 𝑓 𝑗 ≥𝜙 Space: 𝑂 𝑙𝑜𝑔 𝑛 𝜖𝜙 cells Algorithm CountMin: Initialize(L, w): array S[L][w] L hash functions ℎ 1 … ℎ 𝐿 , into {1,…w} Process(int i): for(j=0; j<L; j++) S[j][ ℎ 𝑗 (𝑖) ] += 1; Estimator: foreach i in PossibleIP { 𝑓 𝑖 = 𝑚𝑖𝑛 𝑗 (S[j][ ℎ 𝑗 𝑖 ]); } Heavy hitters: 𝑓 𝑖 ∑ 𝑓 𝑗 ≥𝜙 If 𝑓 𝑖 ∑ 𝑓 𝑗 ≤𝜙 1−𝜖 , not reported If 𝑓 𝑖 ∑ 𝑓 𝑗 ≥𝜙 1+𝜖 , reported as heavy hitter Space: 𝑂 𝑙𝑜𝑔 𝑛 𝜖𝜙 cells

… Time Can improve time; space degrades to 𝑂 log 2 𝑛 𝜖𝜙 Idea: dyadic intervals Each level: one CountMin sketch on the virtual stream Find heavy hitters by following down the tree the heavy hitters (virtual) stream 1, with 1 element [1,n] 𝑖=1 𝑛 𝑓 𝑖 [1,n] 𝑖=1 𝑛/2 𝑓 𝑖 𝑖= 𝑛 2 +1 𝑛 𝑓 𝑖 (virtual) stream 2, with 2 elements [1,n/2], [n/2+1,n] [1,n/2] … (virtual) stream 𝑗, with 2 𝑗 elements [1,n/2^j], … [3,4] (real) stream log 𝑛 , with 𝑛 elements 1,2,…n 3 𝑓 1 𝑓 2 𝑓 3 𝑓 𝑖 𝑓 𝑛

CountMin: linearity Is CountMin linear? Used a lot in practice CountMin( 𝑓 ′ +𝑓′′) from CountMin(𝑓′) and CountMin(𝑓′′) ? Just sum the two! sum the 2 arrays, assuming we use the same hash function ℎ 𝑗 Used a lot in practice https://sites.google.com/site/countminsketch/

CountSketch How about 𝑓= 𝑓 ′ −𝑓′′ ? Ideas to improve it further? Algorithm CountSketch: Initialize(L, w): array S[L][w] L hash func’s ℎ 1 … ℎ 𝐿 , into [w] L hash func’s 𝑟 1 ,… 𝑟 𝐿 , into {±1} Process(int i, real 𝛿 𝑖 ): for(j=0; j<L; j++) S[j][ ℎ 𝑗 (𝑖) ] += 𝑟 𝑗 𝑖 ⋅ 𝛿 𝑖 ; Estimator: foreach i in PossibleIP { 𝑓 𝑖 = 𝑚𝑒𝑑𝑖𝑎 𝑛 𝑗 (S[j][ ℎ 𝑗 𝑖 ]); } How about 𝑓= 𝑓 ′ −𝑓′′ ? Or general streaming “Heavy hitter”: if 𝑓 𝑖 ≥𝜙 𝑗 𝑓 𝑗 =𝜙⋅||𝑓| | 1 “min” is an issue But median is still ok Ideas to improve it further? Use Tug of War 𝑟 in each bucket => CountSketch Better in certain sense (cancelations in a cell)

CountSketch⇒Compressed Sensing Sparse approximations: 𝑓∈ ℜ 𝑛 𝑘-sparse approximation 𝑓 ∗ : min || 𝑓 ∗ −𝑓|| Solution: 𝑓 ∗ = the 𝑘 heaviest elements of 𝑓 Compressed Sensing: [Candes-Romberg-Tao’04, Donoho’04] Want to acquire signal 𝑓 Acquisition: linear measurements (sketch) 𝑆 𝑓 =𝑆𝑓 Goal: recover 𝑘-sparse approximation 𝑓 from 𝑆𝑓 Error guarantee: | 𝑓 −𝑓| ≤ min 𝑘−𝑠𝑝𝑎𝑟𝑠𝑒 𝑓 ∗ || 𝑓 ∗ −𝑓|| Theorem: need only 𝑂(𝑘⋅ log 𝑛 )–size sketch!

Signal Acquisition for CS Single pixel camera [Takhar-Laska-Waskin-Duarte-Baron-Sarvotham-Kelly-Baraniuk’06] One linear measurement = one row of 𝑆 CountSketch: a version of Compr Sensing Set 𝜙=1/2𝑘 𝑓 : take all the heavy hitters (or 𝑘 largest) Space: 𝑂(𝑘log 𝑛 ) source: http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/cscam-SPIEJan06.pdf

Back to Moments General moments: 𝑝 𝑡ℎ moment: 𝑖 𝑓 𝑖 𝑝 𝑝=2 : ∑ 𝑓 𝑖 2 normalized: 𝑖 𝑓 𝑖 𝑝 1/𝑝 𝑝=2 : ∑ 𝑓 𝑖 2 𝑂( log 𝑛 ) via Tug of War (Lec. 3) 𝑝=0 : count # distinct! 𝑂( log 𝑛 ) [Flajolet-Martin] from Lec. 2 𝑝=1: ∑| 𝑓 𝑖 | 𝑂 log 𝑛 : will see later (for all 𝑝∈(0,2)) 𝑝=∞ (normalized): max 𝑖 𝑓 𝑖 Impossible to approximate, but can heavy hitters (Lec. 3) Remains: 2<𝑝<∞ ? Space: Θ 𝑛 1− 2 𝑝 log 2 𝑛 ⇒ Precision Sampling (next) IP Frequency 1 3 2 4 9 5 … 𝑛

A task: estimate sum Compute an estimate 𝑆 from 𝑎1, 𝑎3 a1 a3 a1 a2 a3 Given: 𝑛 quantities 𝑎 1 , 𝑎 2 , … 𝑎 𝑛 in the range [0,1] Goal: estimate 𝑆=𝑎1+𝑎2+… 𝑎 𝑛 “cheaply” Standard sampling: pick random set 𝐽={𝑗1,…𝑗𝑚} of size 𝑚 Estimator: 𝑆 = 𝑛 𝑚 ⋅( 𝑎 𝑗 1 + 𝑎 𝑗 2 +… 𝑎 𝑗 𝑚 ) Chebyshev bound: with 90% success probability 𝑆 – 𝑂(𝑛/𝑚) < 𝑆 < 𝑆 + 𝑂(𝑛/𝑚) For constant additive error, need 𝑚=Ω(𝑛) Compute an estimate 𝑆 from 𝑎1, 𝑎3 a1 a3 a1 a2 a3 a4

Precision Sampling Framework Alternative “access” to 𝑎𝑖’s: For each term 𝑎𝑖, we get a (rough) estimate 𝑎 𝑖 up to some precision 𝑢𝑖, chosen in advance: |𝑎𝑖 – 𝑎 𝑖 | < 𝑢𝑖 Challenge: achieve good trade-off between quality of approximation to 𝑆 use only weak precisions 𝑢𝑖 (minimize “cost” of estimating 𝑎 ) Compute an estimate 𝑆 from 𝑎 1 , 𝑎 2 , 𝑎 3 , 𝑎 4 u1 u2 u3 u4 ã1 ã2 ã3 ã4 a1 a2 a3 a4

Formalization Sum Estimator Adversary What is cost? 1. fix precisions 𝑢𝑖 1. fix 𝑎1,𝑎2,… 𝑎 𝑛 2. fix 𝑎 1 , 𝑎 2 , … 𝑎 𝑛 s.t. | 𝑎 𝑖 − 𝑎 𝑖 | < 𝑢𝑖 3. given 𝑎 1 , 𝑎 2 , … 𝑎 𝑛 , output 𝑆 s.t. 𝑖 𝑎 𝑖 −𝛾 𝑆 <1 (for 𝛾≈1) What is cost? Here, average cost = 1/𝑛⋅∑ 1/𝑢𝑖 to achieve precision 𝑢𝑖, use 1/𝑢𝑖 “resources”: e.g., if 𝑎𝑖 is itself a sum 𝑎𝑖=∑𝑗𝑎𝑖𝑗 computed by subsampling, then one needs Θ(1/𝑢𝑖) samples For example, can choose all 𝑢𝑖=1/𝑛 Average cost ≈ 𝑛

Precision Sampling Lemma Goal: estimate 𝑆=∑ 𝑎 𝑖 from { 𝑎 𝑖 } satisfying | 𝑎 𝑖 − 𝑎 𝑖 |< 𝑢 𝑖 . Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error: 𝑆/1.5−𝑂 1 < 𝑆 <1.5⋅𝑆+𝑂(1) with average cost equal to 𝑂(log⁡𝑛) Example: distinguish Σ 𝑎 𝑖 =3 vs Σ 𝑎 𝑖 =0 Consider two extreme cases: if three 𝑎𝑖=1: enough to have crude approx for all (𝑢𝑖=0.1) if all 𝑎𝑖=3/𝑛: only few with good approx 𝑢𝑖=1/𝑛, and the rest with 𝑢𝑖=1

Precision Sampling: Algorithm Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error: 𝑆/1.5−𝑂 1 < 𝑆 <1.5⋅𝑆+𝑂(1) with average cost equal to 𝑂(log⁡𝑛) Algorithm: Choose each 𝑢 𝑖 𝐸𝑥𝑝(1) i.i.d. Estimator: 𝑆 = max 𝑖 𝑎 𝑖 / 𝑢 𝑖 . Proof of correctness: Claim 1: max 𝑎 𝑖 / 𝑢 𝑖 ∼∑ 𝑎 𝑖 /𝐸𝑥𝑝(1) Hence, max 𝑎 𝑖 / 𝑢 𝑖 = ∑ 𝑎 𝑖 𝐸𝑥𝑝 1 ±1 Claim 2: Avg cost =𝑂( log 𝑛 ) 𝐸𝑥𝑝(1)∼ 𝑒 −𝑥

𝑝-moments via Prec. Sampling Theorem: linear sketch for 𝑝-moment with 𝑂(1) approximation, and 𝑂( 𝑛 1−2/𝑝 log 𝑂(1) 𝑛 ) space (with 90% success probability). Sketch: Pick random 𝑟𝑖{±1}, and 𝑢 𝑖 ∼𝐸𝑥𝑝(1) let 𝑦 𝑖 = 𝑓 𝑖 ⋅ 𝑟 𝑖 / 𝑢 𝑖 1/𝑝 Hash into a hash table 𝑆, 𝑤=𝑂( 𝑛 1− 2 𝑝 log 𝑂 1 𝑛 ) cells Estimator: max 𝑗 𝑆 𝑗 𝑝 Linear 𝑢∼ 𝑒 −𝑢 𝑓= 𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 𝑓6 𝑦 𝟏 + 𝑦 𝟑 𝑦 𝟒 𝑦 𝟐 + 𝑦 𝟓 + 𝑦 𝟔 𝑆=

Correctness of estimation Algorithm PrecisionSamplingFp: Initialize(w): array S[w] hash func ℎ, into [w] hash func 𝑟, into {±1} reals 𝑢 𝑖 , from 𝐸𝑥𝑝 distribution Process(vector 𝑓∈ ℜ 𝑛 ): for(i=0; i<n; i++) S[ℎ(𝑖)] += 𝑓 𝑖 ⋅ 𝑟 𝑖 𝑢 𝑖 1/𝑝 ; Estimator: max 𝑗 𝑆 𝑗 𝑝 Theorem: max 𝑗 𝑆 𝑗 𝑝 is 𝑂(1) approximation with 90% probability, with 𝑤=𝑂( 𝑛 1−2/𝑝 log 𝑂 1 ⁡𝑛) cells Proof: Use Precision Sampling Lem. 𝑎 𝑖 = 𝑓 𝑖 𝑝 ∑ 𝑎 𝑖 =∑ 𝑓 𝑖 𝑝 = 𝐹 𝑝 𝑎 𝑖 = 𝑆 ℎ 𝑖 𝑝 Need to show | 𝑎 𝑖 − 𝑎 𝑖 | small more precisely: 𝑎 𝑖 𝑢 𝑖 − 𝑎 𝑖 𝑢 𝑖 ≤𝜖 𝐹 𝑝

Correctness 2 𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 𝑓6 Claim: 𝑆 ℎ 𝑖 𝑝 − 𝑓 𝑖 𝑝 / 𝑢 𝑖 ≤𝑂(𝜖 𝐹 𝑝 ) Consider cell 𝑧=ℎ(𝑖) 𝑆 𝑧 = 𝑓 𝑖 𝑢 𝑖 1/𝑝 +𝐶 How much chaff 𝐶 is there? 𝐶= 𝑗≠ 𝑖 ∗ 𝑦 𝑗 ⋅  ℎ 𝑗 =𝑧 𝐸 𝐶 2 =… ≤ ||𝑦| | 2 /𝑤 What is ||𝑦| | 2 ? 𝐸 𝑢 ||𝑦| | 2 ≤ ||𝑓| | 2 ⋅𝐸 1 𝑢 2/𝑝 = ||𝑓| | 2 ⋅𝑂 log 𝑛 ||𝑓| | 2 ≤ 𝑛 1−2/𝑝 ||𝑓| | 𝑝 2 By Markov’s: 𝐶 2 ≤ ||𝑓| | 𝑝 2 ⋅ 𝑛 1−2/𝑝 ⋅𝑂( log 𝑛 )/𝑤 with probability >90% Set 𝑤= 1 𝜖 2/𝑝 𝑛 1−2/𝑝 ⋅𝑂( log 𝑛 ), then 𝐶 𝑝 ≤||𝑓| | 𝑝 𝑝 ⋅𝜖=𝜖 𝐹 𝑝 𝑦 𝟏 + 𝑦 𝟑 𝑦 𝟒 𝑦 𝟐 + 𝑦 𝟓 + 𝑦 𝟔 𝑆= 𝑦 𝑖 = 𝑓 𝑖 ⋅ 𝑟 𝑖 / 𝑢 𝑖 1/𝑝 where 𝑟 𝑖 {±1} 𝑢 𝑖 exponential r.v.

Correctness (final) Claim: 𝑆 ℎ 𝑖 𝑝 − 𝑓 𝑖 𝑝 / 𝑢 𝑖 ≤𝑂(𝜖 𝐹 𝑝 ) Proved: 𝑆 ℎ 𝑖 𝑝 = 𝑓 𝑖 𝑢 𝑖 1/𝑝 +𝐶 𝑝 where 𝐶= 𝑗≠ 𝑖 ∗ 𝑦 𝑗 ⋅  ℎ 𝑗 =ℎ(𝑖) Proved: 𝐸 𝐶 2 ≤||𝑦| | 2 /𝑤 this implies 𝐶 𝑝 ≤𝜖 𝐹 𝑝 with 90% for fixed 𝑖 But need for all 𝑖 ! Want: 𝐶 2 ≤𝛽||𝑦| | 2 /𝑤 with high probability for some smallish 𝛽 Can indeed prove for 𝛽=𝑂( log 2 𝑛 ) with strong concentration inequality (Bernstein).

Recap CountSketch: 𝑝-moment for 𝑝>2 Linear sketch for general streaming 𝑝-moment for 𝑝>2 Via Precision Sampling Estimate of sum from poor estimates Sketch: Exp variables + CountSketch