Mergeable Summaries Ke Yi HKUST Pankaj Agarwal (Duke) Graham Cormode (Warwick) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (Aarhus) += ?

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Data Stream Algorithms Frequency Moments
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Mining Data Streams.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Median/Order Statistics Algorithms
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Dagstuhl 2010 University of Puerto Rico Computer Science Department The power of group algebras for constrained multilinear monomial detection Yiannis.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
On ‘Selection and Sorting with Limited Storage’ Graham Cormode Joint work with S. Muthukrishnan, Andrew McGregor, Amit Chakrabarti.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Small Summaries for Big Data
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Quantum Computing MAS 725 Hartmut Klauck NTU
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
New Characterizations in Turnstile Streams with Applications
Finding Frequent Items in Data Streams
Randomized Algorithms
Streaming & sampling.
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Complexity Present sorting methods. Binary search. Other measures.
Randomized Algorithms
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
Chapter 11 Limitations of Algorithm Power
Range-Efficient Computation of F0 over Massive Data Streams
Range Queries on Uncertain Data
Presentation transcript:

Mergeable Summaries Ke Yi HKUST Pankaj Agarwal (Duke) Graham Cormode (Warwick) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (Aarhus) += ?

Small summaries for BIG data  Allows approximate computation with guarantees and small space – save space, time, and communication  Tradeoff between error and size Mergeable Summaries 2

3 Summaries  Summaries allow approximate computations: – Random sampling – Sketches (JL transform, AMS, Count-Min, etc.) – Frequent items – Quantiles & histograms (ε-approximation) – Geometric coresets –…–…

Mergeable Summaries 4 Mergeability  Ideally, summaries are algebraic: associative, commutative – Allows arbitrary computation trees (shape and size unknown to algorithm) – Quality remains the same – Similar to the MUD model [Feldman et al. SODA 08])  Summaries should have bounded size – Ideally, independent of base data size – Sublinear in base data (logarithmic, square root) – Rule out “trivial” solution of keeping union of input  Generalizes the streaming model

Application of mergeability: Large-scale distributed computation Programmers have no control on how things are merged Mergeable Summaries 5

MapReduce Mergeable Summaries 6

Dremel Mergeable Summaries 7

Pregel: Combiners Mergeable Summaries 8

Sensor networks Mergeable Summaries 9

Summaries to be merged  Random samples  Sketches  MinHash  Heavy hitters  ε-approximations (q uantiles, equi-height histograms) Mergeable Summaries 10 easy easy and cute easy algorithm, analysis requires work

Merging random samples Mergeable Summaries 11 + N 1 = 10N 2 = 15 With prob. N 1 /(N 1 +N 2 ), take a sample from S 1 S1S1 S2S2

Merging random samples Mergeable Summaries 12 + N 1 = 9N 2 = 15 With prob. N 1 /(N 1 +N 2 ), take a sample from S 1 S1S1 S2S2

Merging random samples Mergeable Summaries 13 + N 1 = 9N 2 = 15 With prob. N 2 /(N 1 +N 2 ), take a sample from S 2 S1S1 S2S2

Merging random samples Mergeable Summaries 14 + N 1 = 9N 2 = 14 With prob. N 2 /(N 1 +N 2 ), take a sample from S 2 S1S1 S2S2

Merging random samples Mergeable Summaries 15 + N 1 = 8N 2 = 12 With prob. N 2 /(N 1 +N 2 ), take a sample from S 2 S1S1 S2S2 N 3 = 15

Mergeable Summaries 16 Merging sketches  Linear sketches (random projections) are easily mergeable  Data is multiset in domain [1..U], represented as vector x[1..U]  Count-Min sketch: – Creates a small summary as an array of w  d in size – Use d hash functions h to map vector entries to [1..w]  Trivially mergeable: CM(x + y) = CM(x) + CM(y) w d Array: CM[i,j]

MinHash Mergeable Summaries 17 Hash function h Retain the k elements with smallest hash values Trivially mergeable

Summaries to be merged  Random samples  Sketches  MinHash  Heavy hitters  ε-approximations (q uantiles, equi-height histograms) Mergeable Summaries 18 easy easy and cute easy algorithm, analysis requires work

Mergeable Summaries 19 Heavy hitters k=5

Mergeable Summaries 20 Heavy hitters k=5

Mergeable Summaries 21 Heavy hitters k=5

Mergeable Summaries 22 Streaming MG analysis  N = total input size  Previous analysis shows that the error is at most – N/(k+1) [MG’82] – standard bound, but too weak – F 1 res(k) /(k+1) [Berinde et al. TODS’10] – too strong  M = sum of counters in data structure  Error in any estimated count at most (N-M)/(k+1) – Estimated count a lower bound on true count – Each decrement spread over (k+1) items: 1 new one and k in MG – Equivalent to deleting (k+1) distinct items from stream – At most (N-M)/(k+1) decrement operations – Hence, can have “deleted” (N-M)/(k+1) copies of any item – So estimated counts have at most this much error

Mergeable Summaries 23 Merging two MG summaries  Merging algorithm: – Merge two sets of k counters in the obvious way – Take the (k+1)th largest counter = C k+1, and subtract from all – Delete non-positive counters k=5

(prior error)(from merge) Mergeable Summaries 24 Merging two MG summaries  This algorithm gives mergeability: – Merge subtracts at least (k+1)C k+1 from counter sums – So (k+1)C k+1  (M 1 + M 2 – M 12 ) Sum of remaining (at most k) counters is M 12 – By induction, error is ((N 1 -M 1 ) + (N 2 -M 2 ) + (M 1 +M 2 –M 12 ))/(k+1)=((N 1 +N 2 ) –M 12 )/(k+1) (as claimed) k=5 C k+1

Compare with previous merging algorithms  Two previous merging algorithms for MG [Manjhi et al. SIGMOD’05, ICDE’05] – No guarantee on size – Error increases after each merge – Need to know the size or the height of the merging tree in advance for provisioning Mergeable Summaries 25

Compare with previous merging algorithms  Experiment on a BFS routing tree over 1024 randomly deployed sensor nodes  Data – Zipf distribution Mergeable Summaries 26

Compare with previous merging algorithms  On a contrived example Mergeable Summaries 27

SpaceSaving: Another heavy hitter summary  At least 10+ papers on this problem  The “SpaceSaving” (SS) summary also keeps k counters [Metwally et al. TODS’06] – If stream item not in summary, overwrite item with least count – SS seems to perform better in practice than MG  Surprising observation: SS is actually isomorphic to MG! – An SS summary with k+1 counters has same info as MG with k – SS outputs an upper bound on count, which tends to be tighter than the MG lower bound  Isomorphism is proved inductively – Show every update maintains the isomorphism  Immediate corollary: SS is mergeable

Summaries to be merged  Random samples  Sketches  MinHash  Heavy hitters  ε-approximations (q uantiles, equi-height histograms) Mergeable Summaries 29 easy easy and cute easy algorithm, analysis requires work

ε -approximations: a more “uniform” sample  A “uniform” sample needs 1/ε sample points  A random sample needs Θ(1/ε 2 ) sample points (w/ constant prob.) Mergeable Summaries 30 Random sample:

Mergeable Summaries 31 Quantiles (order statistics)  Quantiles generalize median: – Exact answer: CDF -1 (  ) for 0 <  < 1 – Approximate version: tolerate answer in CDF -1 (  -  )…CDF -1 (  +  ) –  -approximation solves dual problem: estimate CDF(x)   Binary search to find quantiles

Quantiles gives equi-height histogram  Automatically adapts to skew data distributions  Equi-width histograms (fixed binning) are trivially mergeable but does not adapt to data distribution Mergeable Summaries 32

Previous quantile summaries  Streaming – At least 10+ papers on this problem – GK algorithm: O(1/ε log n) [Greenwald and Khana, SIGMOD’01] – Randomized algorithm: O(1/ε log 3 (1/ε)) [Suri et al. DCG’06]  Mergeable – Q-digest: O(1/ε log U) [Shrivastava et al. Sensys’04] Requires a fixed universe of size U – [Greenwald and Khana, PODS’04] Error increases after each merge (not truly mergeable!) – New: O(1/ε log 1.5 (1/ε)) Works in comparison model Mergeable Summaries 33

Mergeable Summaries 34 Equal-weight merges  A classic result (Munro-Paterson ’80): – Base case: fill summary with k input points – Input: two summaries of size k, built from data sets of the same size – Merge, sort summaries to get size 2k – Take every other element  Error grows proportionally to height of merge tree  Randomized twist: – Randomly pick whether to take odd or even elements

Equal-weight merge analysis: Base case Mergeable Summaries 35  Let the resulting sample be S. Consider any interval I  Estimate 2| I  S| is unbiased and has error at most 1 – | I  D| is even: 2| I  S| has no error – | I  D| is odd: 2| I  S| has error  1 with equal prob. 2

Equal-weight merge analysis: Multiple levels Level i=1 Level i=2 Level i=3 Level i=4  Consider j’th merge at level i of L (i-1), R (i-1) to S (i) – Estimate is 2 i | I  S (i) | – Error introduced by replacing L, R with S is X i,j = 2 i | I  S i | - ( 2 i-1 | I  ( L (i-1)  R (i-1) )|) (new estimate) (old estimate) – Absolute error | X i,j |  2 i-1 by previous argument  Bound total error over all m levels by summing errors: – M =  i,j X i,j =  1  i  m  1  j  2 m-i X i,j – max |M| grows with levels – Var[M] doesn’t! (dominated by highest level)

Equal-sized merge analysis: Chernoff bound  Chernoff-Hoeffding: Give unbiased variables Y j s.t. | Y j |  y j : Pr[ abs(  1  j  t Y j ) >  ]  2exp(-2  2 /  1  j  t ( 2y j ) 2 )  Set  = h 2 m for our variables: – 2  2 /(  i  j (2 max(X i,j ) 2 ) = 2(h2 m ) 2 / (  i 2 m-i. 2 2i ) = 2h 2 2 2m /  i 2 m+i = 2h 2 /  i 2 i-m = 2h 2 /  i 2 -i  h 2  From Chernoff bound, error probability is at most 2exp(-2 h 2 ) – Set h = O(log 1/2  -1 ) to obtain 1-  probability of success Mergeable Summaries 37 Level i=1 Level i=2 Level i=3 Level i=4

Mergeable Summaries 38 Equal-sized merge analysis: finishing up  Chernoff bound ensures absolute error at most  = h2 m – m is number of merges = log (n/k) for summary size k – So error is at most hn/k  Set size of each summary k to be O(h/  ) = O(1/  log 1/2 1/  ) – Guarantees give  n error with probability 1-  for any one range  There are O(1/  ) different ranges to consider – Set  = Θ(  ) to ensure all ranges are correct with constant probability – Summary size: O(1/  log 1/2 1/  )

 Use equal-size merging in a standard logarithmic trick:  Merge two summaries as binary addition  Fully mergeable quantiles, in O(1/  log n log 1/2 1/  ) – n = number of items summarized, not known a priori  But can we do better? Mergeable Summaries 39 Fully mergeable  -approximation Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1 Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1 Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1Wt 4

Mergeable Summaries 40 Hybrid summary  Classical result: It’s sufficient to build the summary on a random sample of size Θ(1/ε 2 ) – Problem: Don’t know n in advance  Hybrid structure: – Keep top O(log 1/  ) levels: summary size O(1/  log 1.5 (1/  )) – Also keep a “buffer” sample of O(1/  ) items – When buffer is “full”, extract points as a sample of lowest weight Wt 32Wt 16Wt 8 Buffer

 -approximations in higher dimensions  -approximations generalize to range spaces with bounded VC-dimension – Generalize the “odd-even” trick to low-discrepancy colorings –  - approx for constant VC-dimension d has size Õ(  -2d/(d+1) ) Mergeable Summaries 41

Mergeable Summaries 42 Other mergeable summaries:  -kernels  - kernels in d-dimensional space approximately preserve the projected extent in any direction –  - kernel has size O(1/  (d-1)/2 ) – Streaming  - kernel has size O(1/  (d-1)/2 log(1/  )) – Mergeable  - kernel has size O(1/  (d-1)/2 log d n)

Summary StaticStreamingMergeable Heavy hitters1/ε ε-approximation (quantiles) deterministic 1/ε1/ε log n1/ε log U ε-approximation (quantiles) randomized -1/ε log 1.5 (1/ε) ε-kernel 1/ε (d-1)/2 1/ε (d-1)/2 log(1/ε) 1/  (d-1)/2 log d n Mergeable Summaries 43

Mergeable Summaries 44 Open problems  Better bound on mergeable  -kernels – Match the streaming bound?  Lower bounds for mergeable summaries – Separation from streaming model?  Other streaming algorithms (summaries) – L p sampling – Coresets for minimum enclosing balls (MEB)

Thank you! Mergeable Summaries45

Hybrid analysis (sketch)  Keep the buffer (sample) size to O(1/  ) – Accuracy only √  n – If buffer only summarizes O(  n) points, this is OK  Analysis rather delicate: – Points go into/out of buffer, but always moving “up” – Number of “buffer promotions” is bounded – Similar Chernoff bound to before on probability of large error – Gives constant probability of accuracy in O(1/  log 1.5 (1/  )) space Mergeable Summaries 46

Mergeable Summaries 47 Models of summary construction  Offline computation: e.g. sort data, take percentiles  Streaming: summary merged with one new item each step  One-way merge – Caterpillar graph of merges  Equal-weight merges: can only merge summaries of same weight  Full mergeability (algebraic): allow arbitrary merging trees – Our main interest