Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Approximations for Min Connected Sensor Cover Ding-Zhu Du University of Texas at Dallas.
CS4432: Database Systems II
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mergeable Summaries Ke Yi HKUST Pankaj Agarwal (Duke) Graham Cormode (Warwick) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (Aarhus) += ?
Sampling: Final and Initial Sample Size Determination
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
FLANN Fast Library for Approximate Nearest Neighbors
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Small Summaries for Big Data
online convex optimization (with partial information)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
A New Hybrid Wireless Sensor Network Localization System Ahmed A. Ahmed, Hongchi Shi, and Yi Shang Department of Computer Science University of Missouri-Columbia.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Data Transformation: Normalization
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Data Mining: Concepts and Techniques
Computing and Compressive Sensing in Wireless Sensor Networks
Finding Frequent Items in Data Streams
Streaming & sampling.
Lecture 18: Uniformity Testing Monotonicity Testing
Query-Friendly Compression of Graph Streams
Sublinear Algorithmic Tools 2
Random Sampling on Big Data: Techniques and Applications
Lecture 4: CountSketch High Frequencies
Data Integration with Dependent Sources
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Random Sampling over Joins Revisited
Range-Efficient Counting of Distinct Elements
CIS 700: “algorithms for Big Data”
Linear sketching with parities
Qun Huang, Patrick P. C. Lee, Yungang Bao
Near-Optimal (Euclidean) Metric Compression
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
Linear sketching with parities
Range-Efficient Computation of F0 over Massive Data Streams
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Range Queries on Uncertain Data
Sublinear Algorihms for Big Data
Presentation transcript:

Summarizing Distributed Data Ke Yi HKUST += ?

Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space, time, and communication  Tradeoff between error and size Summarizing Disitributed Data 2

Summarization vs (Lossy) Compression Summarization:  No need to decompress before making queries  Aims at particular properties of the data  (Usually) provides guarantees on query results Compression  Need to decompress before making queries  Aims at generic approximation of all data  Best-effort approach; does not provide guarantees Summarizing Disitributed Data 3

4 Summaries  Summaries allow approximate computations: – Random sampling – Frequent items – Sketches (JL transform, AMS, Count-Min, etc.) – Quantiles & histograms – Geometric coresets –…–…

Large-scale distributed computation Programmers have no control on how things are merged Summarizing Disitributed Data 5

MapReduce Summarizing Disitributed Data 6

Dremel Summarizing Disitributed Data 7

Pregel: Combiners Summarizing Disitributed Data 8

Sensor networks Summarizing Disitributed Data 9

“A major technical challenge for big data is to push summarization to edge devices.” Jagadish et al, “Technical Challenges for Big Data”, Communications of the ACM, August Summarizing Disitributed Data10

Summarizing Disitributed Data 11 Two models of summary computation  Mergeability: Summarization behaves like a semigroup operator – Allows arbitrary computation trees (shape and size unknown to algorithm) – Quality remains the same – Any intermediate summary is valid – Resulting summary can be further merged – Generalizes the streaming model  Multi-party communication – Simultaneous message passing model – Message passing model – Blackboard model

Mergeable summaries  Sketches  Random samples  MinHash  Heavy hitters  ε-approximations (q uantiles, equi-height histograms) Agarwal, Cormode, Huang, Phillips, Wei, and Yi, “Mergeable Summaries”, TODS, Nov 2013 Summarizing Disitributed Data 12 easy easy and cute easy algorithm, analysis requires work

Merging random samples Summarizing Disitributed Data 13 +

Merging random samples Summarizing Disitributed Data 14 +

Merging random samples Summarizing Disitributed Data 15 +

Merging random samples Summarizing Disitributed Data 16 +

Merging random samples Summarizing Disitributed Data 17 +

Summarizing Disitributed Data 18 Merging sketches return the min

MinHash Summarizing Disitributed Data 19

Mergeable summaries  Random samples  Sketches  MinHash  Heavy hitters  ε-approximations (q uantiles, equi-height histograms) Summarizing Disitributed Data 20 easy easy and cute easy algorithm, analysis requires work

Summarizing Disitributed Data 21 Heavy hitters

Summarizing Disitributed Data 22 Heavy hitters

Summarizing Disitributed Data 23 Heavy hitters

Summarizing Disitributed Data 24 Streaming MG analysis

Summarizing Disitributed Data 25 Merging two MG summaries

(prior error) (from merge) Summarizing Disitributed Data 26 Merging two MG summaries

SpaceSaving: Another heavy hitter summary Summarizing Disitributed Data 27

Mergeable summaries  Random samples  Sketches  MinHash  Heavy hitters  ε-approximations (q uantiles, equi-height histograms) Summarizing Disitributed Data 28 easy easy and cute easy algorithm, analysis requires work

ε -approximations: a more “uniform” sample Summarizing Disitributed Data 29 Random sample:

Summarizing Disitributed Data 30 Quantiles (order statistics)

Quantiles gives equi-height histogram  Automatically adapts to skew data distributions  Equi-width histograms (fixed binning) are trivially mergeable but does not adapt to data distribution Summarizing Disitributed Data 31

Previous quantile summaries Summarizing Disitributed Data 32

Summarizing Disitributed Data 33 Equal-weight merges

Equal-weight merge analysis: Base case Summarizing Disitributed Data 34 2

Equal-weight merge analysis: Multiple levels Level i=1 Level i=2 Level i=3 Level i=4 Summarizing Disitributed Data 35

Equal-sized merge analysis: Chernoff bound  Chernoff-Hoeffding: Give unbiased variables Y j s.t. | Y j |  y j : Pr[ abs(  1  j  t Y j ) >  ]  2exp(-2  2 /  1  j  t ( 2y j ) 2 )  Set  = h 2 m for our variables: – 2  2 /(  i  j (2 max(X i,j ) 2 ) = 2(h2 m ) 2 / (  i 2 m-i. 2 2i ) = 2h 2 2 2m /  i 2 m+i = 2h 2 /  i 2 i-m = 2h 2 /  i 2 -i  h 2  From Chernoff bound, error probability is at most 2exp(-2 h 2 ) – Set h = O(log 1/2  -1 ) to obtain 1-  probability of success Summarizing Disitributed Data 36 Level i=1 Level i=2 Level i=3 Level i=4

Summarizing Disitributed Data 37 Equal-sized merge analysis: finishing up  Chernoff bound ensures absolute error at most  = h2 m – m is number of merges = log (n/k) for summary size k – So error is at most hn/k  Set size of each summary k to be O(h/  ) = O(1/  log 1/2 1/  ) – Guarantees give  n error with probability 1-  for any one range  There are O(1/  ) different ranges to consider – Set  = Θ(  ) to ensure all ranges are correct with constant probability – Summary size: O(1/  log 1/2 1/  )

 Use equal-size merging in a standard logarithmic trick:  Merge two summaries as binary addition  Fully mergeable quantiles, in O(1/  log n log 1/2 1/  ) – n = number of items summarized, not known a priori  But can we do better? Summarizing Disitributed Data 38 Fully mergeable  -approximation Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1 Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1 Wt 32Wt 16Wt 8Wt 4Wt 2Wt 1Wt 4

Summarizing Disitributed Data 39 Hybrid summary  Classical result: It’s sufficient to build the summary on a random sample of size Θ(1/ε 2 ) – Problem: Don’t know n in advance  Hybrid structure: – Keep top O(log 1/  ) levels: summary size O(1/  log 1.5 (1/  )) – Also keep a “buffer” sample of O(1/  ) items – When buffer is “full”, extract points as a sample of lowest weight Wt 32Wt 16Wt 8 Buffer

 -approximations in higher dimensions  -approximations generalize to range spaces with bounded VC-dimension – Generalize the “odd-even” trick to low-discrepancy colorings –  - approx for constant VC-dimension d has size Õ(  -2d/(d+1) ) Summarizing Disitributed Data 40

Summarizing Disitributed Data 41 Other mergeable summaries:  -kernels  - kernels in d-dimensional space approximately preserve the projected extent in any direction –  - kernel has size O(1/  (d-1)/2 ) – Streaming  - kernel has size O(1/  (d-1)/2 log(1/  )) – Mergeable  - kernel has size O(1/  (d-1)/2 log d n)

Summary StaticStreamingMergeable Heavy hitters1/ε ε-approximation (quantiles) deterministic 1/ε1/ε log n1/ε log U ε-approximation (quantiles) randomized -1/ε log 1.5 (1/ε) ε-kernel 1/ε (d-1)/2 1/ε (d-1)/2 log(1/ε) 1/  (d-1)/2 log d n Summarizing Disitributed Data 42

Mergeability vs k-party communication  Mergeability is a property of a summary itself – Makes the summary behave like a simple commutative and associative aggregate – Is one way to summarize distributed data – Total communication cost: O(k  summary size)  Can we do better in the k-party communication model? – Size and/or shape of merging tree known in advance – Only the final summary is valid – Resulting summary may not be further merged Summarizing Disitributed Data 43

Random sample Summarizing Disitributed Data 44

Some negative results Summarizing Disitributed Data 45

Heavy hitters Summarizing Disitributed Data 46

Heavy hitters Summarizing Disitributed Data 47

Heavy hitters Summarizing Disitributed Data 48

Summarizing Disitributed Data 49 Huang and Yi, “The Communication Complexity of Distributed ε-Approximations”, FOCS’14

Thank you!