Download presentation
Presentation is loading. Please wait.
Published byKenna Westbury Modified over 9 years ago
1
Summarizing Distributed Data Ke Yi HKUST += ?
2
Small summaries for BIG data Allow approximate computation with guarantees and small space – save space, time, and communication Tradeoff between error and size Summarizing Disitributed Data 2
3
Summarization vs (Lossy) Compression Summarization: No need to decompress before making queries Aims at particular properties of the data (Usually) provides guarantees on query results Compression Need to decompress before making queries Aims at generic approximation of all data Best-effort approach; does not provide guarantees Summarizing Disitributed Data 3
4
4 Summaries Summaries allow approximate computations: – Random sampling – Frequent items – Sketches (JL transform, AMS, Count-Min, etc.) – Quantiles & histograms – Geometric coresets –…–…
5
Large-scale distributed computation Programmers have no control on how things are merged Summarizing Disitributed Data 5
6
MapReduce Summarizing Disitributed Data 6
7
Dremel Summarizing Disitributed Data 7
8
Pregel: Combiners Summarizing Disitributed Data 8
9
Sensor networks Summarizing Disitributed Data 9
10
“A major technical challenge for big data is to push summarization to edge devices.” Jagadish et al, “Technical Challenges for Big Data”, Communications of the ACM, August 2014. Summarizing Disitributed Data10
11
Summarizing Disitributed Data 11 Two models of summary computation Mergeability: Summarization behaves like a semigroup operator – Allows arbitrary computation trees (shape and size unknown to algorithm) – Quality remains the same – Any intermediate summary is valid – Resulting summary can be further merged – Generalizes the streaming model Multi-party communication – Simultaneous message passing model – Message passing model – Blackboard model
12
Mergeable summaries Sketches Random samples MinHash Heavy hitters ε-approximations (q uantiles, equi-height histograms) Agarwal, Cormode, Huang, Phillips, Wei, and Yi, “Mergeable Summaries”, TODS, Nov 2013 Summarizing Disitributed Data 12 easy easy and cute easy algorithm, analysis requires work
13
Merging random samples Summarizing Disitributed Data 13 +
14
Merging random samples Summarizing Disitributed Data 14 +
15
Merging random samples Summarizing Disitributed Data 15 +
16
Merging random samples Summarizing Disitributed Data 16 +
17
Merging random samples Summarizing Disitributed Data 17 +
18
Summarizing Disitributed Data 18 Merging sketches return the min
19
MinHash Summarizing Disitributed Data 19
20
Mergeable summaries Random samples Sketches MinHash Heavy hitters ε-approximations (q uantiles, equi-height histograms) Summarizing Disitributed Data 20 easy easy and cute easy algorithm, analysis requires work
21
Summarizing Disitributed Data 21 Heavy hitters 1 2 3 4 5 6 7 8 9
22
Summarizing Disitributed Data 22 Heavy hitters 1 2 3 4 5 6 7 8 9
23
Summarizing Disitributed Data 23 Heavy hitters 1 2 3 4 5 6 7 8 9
24
Summarizing Disitributed Data 24 Streaming MG analysis
25
Summarizing Disitributed Data 25 Merging two MG summaries 1 2 3 4 5 6 7 8 9
26
(prior error) (from merge) Summarizing Disitributed Data 26 Merging two MG summaries 1 2 3 4 5 6 7 8 9
27
SpaceSaving: Another heavy hitter summary Summarizing Disitributed Data 27
28
Mergeable summaries Random samples Sketches MinHash Heavy hitters ε-approximations (q uantiles, equi-height histograms) Summarizing Disitributed Data 28 easy easy and cute easy algorithm, analysis requires work
29
ε -approximations: a more “uniform” sample Summarizing Disitributed Data 29 Random sample:
30
Summarizing Disitributed Data 30 Quantiles (order statistics)
31
Quantiles gives equi-height histogram Automatically adapts to skew data distributions Equi-width histograms (fixed binning) are trivially mergeable but does not adapt to data distribution Summarizing Disitributed Data 31
32
Previous quantile summaries Summarizing Disitributed Data 32
33
Summarizing Disitributed Data 33 Equal-weight merges 1567815678 234910 1357913579 +
34
Equal-weight merge analysis: Base case Summarizing Disitributed Data 34 2
35
Equal-weight merge analysis: Multiple levels Level i=1 Level i=2 Level i=3 Level i=4 Summarizing Disitributed Data 35
36
Equal-sized merge analysis: Chernoff bound Chernoff-Hoeffding: Give unbiased variables Y j s.t. | Y j | y j : Pr[ abs( 1 j t Y j ) > ] 2exp(-2 2 / 1 j t ( 2y j ) 2 ) Set = h 2 m for our variables: – 2 2 /( i j (2 max(X i,j ) 2 ) = 2(h2 m ) 2 / ( i 2 m-i. 2 2i ) = 2h 2 2 2m / i 2 m+i = 2h 2 / i 2 i-m = 2h 2 / i 2 -i h 2 From Chernoff bound, error probability is at most 2exp(-2 h 2 ) – Set h = O(log 1/2 -1 ) to obtain 1- probability of success Summarizing Disitributed Data 36 Level i=1 Level i=2 Level i=3 Level i=4
37
Summarizing Disitributed Data 37 Equal-sized merge analysis: finishing up Chernoff bound ensures absolute error at most = h2 m – m is number of merges = log (n/k) for summary size k – So error is at most hn/k Set size of each summary k to be O(h/ ) = O(1/ log 1/2 1/ ) – Guarantees give n error with probability 1- for any one range There are O(1/ ) different ranges to consider – Set = Θ( ) to ensure all ranges are correct with constant probability – Summary size: O(1/ log 1/2 1/ )
38
Use equal-size merging in a standard logarithmic trick: Merge two summaries as binary addition Fully mergeable quantiles, in O(1/ log n log 1/2 1/ ) – n = number of items summarized, not known a priori But can we do better? Summarizing Disitributed Data 38 Fully mergeable -approximation Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1 Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1 Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1Wt 4
39
Summarizing Disitributed Data 39 Hybrid summary Classical result: It’s sufficient to build the summary on a random sample of size Θ(1/ε 2 ) – Problem: Don’t know n in advance Hybrid structure: – Keep top O(log 1/ ) levels: summary size O(1/ log 1.5 (1/ )) – Also keep a “buffer” sample of O(1/ ) items – When buffer is “full”, extract points as a sample of lowest weight Wt 32Wt 16Wt 8 Buffer
40
-approximations in higher dimensions -approximations generalize to range spaces with bounded VC-dimension – Generalize the “odd-even” trick to low-discrepancy colorings – - approx for constant VC-dimension d has size Õ( -2d/(d+1) ) Summarizing Disitributed Data 40
41
Summarizing Disitributed Data 41 Other mergeable summaries: -kernels - kernels in d-dimensional space approximately preserve the projected extent in any direction – - kernel has size O(1/ (d-1)/2 ) – Streaming - kernel has size O(1/ (d-1)/2 log(1/ )) – Mergeable - kernel has size O(1/ (d-1)/2 log d n)
42
Summary StaticStreamingMergeable Heavy hitters1/ε ε-approximation (quantiles) deterministic 1/ε1/ε log n1/ε log U ε-approximation (quantiles) randomized -1/ε log 1.5 (1/ε) ε-kernel 1/ε (d-1)/2 1/ε (d-1)/2 log(1/ε) 1/ (d-1)/2 log d n Summarizing Disitributed Data 42
43
Mergeability vs k-party communication Mergeability is a property of a summary itself – Makes the summary behave like a simple commutative and associative aggregate – Is one way to summarize distributed data – Total communication cost: O(k summary size) Can we do better in the k-party communication model? – Size and/or shape of merging tree known in advance – Only the final summary is valid – Resulting summary may not be further merged Summarizing Disitributed Data 43
44
Random sample Summarizing Disitributed Data 44
45
Some negative results Summarizing Disitributed Data 45
46
Heavy hitters Summarizing Disitributed Data 46
47
Heavy hitters Summarizing Disitributed Data 47
48
Heavy hitters Summarizing Disitributed Data 48
49
Summarizing Disitributed Data 49 Huang and Yi, “The Communication Complexity of Distributed ε-Approximations”, FOCS’14
50
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.