Approximation and Load Shedding for QoS in DSMS*

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
Optimal Workload-Based Weighted Wavelet Synopsis
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Data Stream Mining and Querying
Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a.
1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.
Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Histograms for Selectivity Estimation
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
Load Shedding CS240B notes.
A paper on Join Synopses for Approximate Query Answering
Finding Frequent Items in Data Streams
Streaming & sampling.
Lattice Histograms: A Resilient Synopsis Structure
Query-Friendly Compression of Graph Streams
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
AQUA: Approximate Query Answering
Data Integration with Dependent Sources
Range-Efficient Counting of Distinct Elements
Y. Kotidis, S. Muthukrishnan,
SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS
Range-Efficient Computation of F0 over Massive Data Streams
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Load Shedding CS240B notes.
Wavelet-based histograms for selectivity estimation
Approximation and Load Shedding Sampling Methods
Presentation transcript:

Approximation and Load Shedding for QoS in DSMS* Carlo Zaniolo CSD—UCLA ________________________________________ * Notes based on a VLDB’02 tutorial by Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

DSMS Approximation and Load Schedding DSMS: online response on boundless and bursty data streams—How? By using approximations and synopses and even Shedding load when arrival rates become impossible Approximations and Synopses are often used with normal load too Shedding is used for bursty streams and overload situations. 2 2

Synopses and Approximation Synopsis: bounded-memory history-approximation Succinct summary of old stream tuples Examples Sliding Windows Samples Histograms Wavelet representation Sketching techniques Approximate Algorithms: e.g., median, quantiles,… Fast and light Data Mining algorithms

Synopses Windows: logical, physical (already discussed) Samples: Answering queries using samples Histograms: Equi-depth histograms, On-line quantile computation Wavelets: Haar-wavelet histogram construction & maintenance Sketches.

Sampling: Basics Idea: A small random sample S of the data often well-represents all the data For a fast approx answer, apply “modified” query to S Example: select agg from R where odd(R.e) (n=12) If agg is avg, return average of odd elements in S If agg is count, return average over all elements e in S of 1 if e is odd 0 if e is even Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 3 * 12/4 =9

Sampling—some background Reservoir Sampling [Vit85]: Maintains a sample S having a pre-assigned size M on a stream of arbitrary size Concise sampling [GM98]: Duplicates in sample S stored as <value, count> pairs (thus, potentially boosting actual sample size) Window Sampling [BDM02,BOZ09]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples. More later …

Probabilistic Guarantees For all approximation methods we need some probabilistic guarantees: Example: Actual answer is within 5 ± 1 with prob  0.9 Use Tail Inequalities to give probabilistic bounds on returned answer Markov Inequality Chebyshev’s Inequality Hoeffding’s Inequality Chernoff Bound

Load Shedding & Sampling Given a complex Query graph how to use/manage the sampling process [BDM04] [LawZ02] More about this later.

Overview Windows: logical, physical (covered) Samples: Answering queries using samples Histograms: Equi-depth histograms Wavelets: Haar-wavelet Sketches

Histograms Histograms approximate the frequency distribution of element values in a stream A histogram (typically) consists of A partitioning of element domain values into buckets A count per bucket B (of the number of elements in B) Widely used in DBMS query optimization Many Types of Proposed, e.g.: Equi-Depth Histograms: select buckets such that counts per bucket are equal V-Optimal Histograms: select buckets to minimize frequency variance within buckets Wavelet-based Histograms

Types of Histograms: Equi-Depth Histograms Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Types of Histograms: V-Optimal Histograms V-Optimal Histograms [IP95] [JKM98]. Idea: Select buckets to minimize frequency variance within buckets Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Minimize: The histogram consists of J bins or buckets, nj is the number of items in the jth bin, and Vj is the variance between the values associated with the items in the jth bin.

Equi-Depth Histogram Construction For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b Example: (n=12, b=4) Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 9 (.75-quantile) rank = 3 (.25-quantile) rank = 6 (.5-quantile)

Answering Queries Histograms [IP99] (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation Example: select count(*) from R where 4 <= R.e <= 15 For equi-depth histograms, maximum error: Count spread evenly among bucket values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4  R.e  15 answer: 3.5 *

Approximate Algorithms Quantiles Using Samples Quantiles from Synopses One pass algorithms for approximate samples … Much work in this area … e.g. see [MZ11]

Overview Windows: logical, physical (covered) Samples: Answering queries using samples Histograms: Equi-depth histograms Wavelets: Haar-wavelet histogram Sketches

One-Dimensional Haar Wavelets Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet basis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] 1 [1.5, 4] [0.5, 0] [2.75] [-1.25]

Haar Wavelet Coefficients Hierarchical decomposition structure (a.k.a. “error tree”) Coefficient “Supports” + - + -1 0.5 2.75 -1.25 2.75 + -1.25 + - 0.5 + - + - -1 -1 + - + - + - + - 2 2 0 2 3 5 4 4 Original frequency distribution

Compressed Wavelet Representations Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution Steps Compute cumulative frequency distribution C Compute linear wavelet transform of C Greedy heuristic methods Retain coefficients leading to large error reduction Throw away coefficients that give small increase in error

Overview Sketches Windows: logical, physical (covered) Samples: Answering queries using samples Histograms: Equi-depth histograms, On-line quantile computation Wavelets: Haar-wavelet histogram construction & maintenance Sketches

Sketches Conventional data summaries fall short: Hard to count distinct items by sampling: infrequent ones will be missed Samples (e.g., using Reservoir Sampling) perform poorly for joins Multi-d histograms/wavelets: Construction requires multiple passes over the data Different approach: Randomized sketch synopses Only logarithmic space Probabilistic guarantees on the quality of the approximate answer Can handle extreme cases.

Synopses structures: sketches Synopsis structure taking advantage of high volumes of data Provides an approximate result with probabilistic bounds Random projections on smaller spaces (hash functions) Many sketch structures: usually dedicated to a specialized task

Synopses structures: sketches E.g. A Hash-based method: COUNT (Flajolet 85) Goal Number N of distinct values in a stream (for large N) Ex. number of distinct IP addresses going through a router Sketch structure SK: L bits initialized to 0 H: hashing function transforming an element of the stream into L bits H distributes uniformly elements of the stream on the 2L possibilities 18.6.7.1 1

Synopses structures: a count-distinct method Maintenance and update of SK For each new element e Compute H(e) Select the position of the rightmost 1 in H(e) But then remember the leftmost 1 position among the samples SK 1 H(18.6.7.1) 1 New SK 1

A count-distinct method Result Select the position R (0…L-1) of the leftmost 0 in SK E(R) = log2 (φ*N) with φ = 0.77351… σ(R) = 1.12 1 SK Proba(SK(0)=1= n/2 Proba(SK For n elements already seen, we expect: SK[0] is forced to 1 N/2 times SK[1] is forced to 1 N/4 times SK[k] is forced to 1 N/2k+1 times R?

Linear-Projection Sketches (a.k.a. AMS)

Linear-Projection Sketches (a.k.a. AMS) Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a stream of i-values Basic Construct: Randomized Linear Projection of f() = project onto inner/dot product of f-vector Simple to compute over the stream: Add whenever the i-th value is seen Tunable probabilistic guarantees on approximation error f(1) f(2) f(3) f(4) f(5) 1 2 Data stream: 3, 1, 2, 4, 2, 3, 5, . . . where = vector of random values from an appropriate distribution Data stream: 3, 1, 2, 4, 2, 3, 5, . . .

Estimitating Size of Binary-Joins Problem: Compute answer for the query COUNT(R A S) Example: 3 2 1 Data stream R.A: 4 1 2 4 1 4 1 2 3 4 Data stream S.A: 3 1 2 4 2 4 1 2 3 4 = 10 (2 + 2 + 0 + 6) Exact solution: too expensive, requires O(N) space! N = sizeof(domain(A))

Basic AMS Sketching Technique [AMS96] Key Intuition: Use randomized linear projections of f() to define random variable X such that X is easily computed over the stream (in small space) E[X] = COUNT(R A S) Var[X] is small Basic Idea: Define a family of 4-wise independent {-1, +1} random variables Pr[ = +1] = Pr[ = -1] = 1/2 Expected value of each , E[ ] = 0 Variables are 4-wise independent Expected value of product of 4 distinct = 0 Variables can be generated using pseudo-random generator using only O(log N) space (for seeding)! Probabilistic error guarantees (e.g., actual answer is 10±1 with probability 0.9)

AMS Sketch Construction X = XRXS to be estimate of COUNT query Compute random variables: and Simply add to XR whenever the i-th value is observed in the R.A stream Example: 3 2 1 Data stream R.A: 4 1 2 4 1 4 1 2 3 4 2 2 1 1 Data stream S.A: 3 1 2 4 2 4 1 2 3 4

Sketches Applications Because of the four-wise independence the product XRXS the is an unbiased estimate of the correct count of those natural joins. Thus it can be used for semantic load shedding In practice: accuracy can be improved by multiple runs of the process and then taking the average, and finally the median of averages. Many special-purpose sketch techniques have been proposed for different applications. Here we have seen (i) estimating IP addresses and (ii) size of equi-joins.

Dropping Tuples (for load shedding) Random load shedding: tuples are dropped without paying attention to actual tuple values Semantic Load Shedding: based on tuple values Some tuples values are more important to the utility (more useful) than some others Example: Window Joins on streams. A one hour window on each stream. What do we do when there is insufficient memory to keep the entire state in order to provide the exact result of sliding window join?

Load Shedding for Window Joins for Multiple Data Streams Compute continuous sliding-window joins between r streams S1, …, Sr with window W1,…,Wr. Memory M W1 S1 ….. Output Sr Wr Join operator 1. A simple solution is to drop the older tuples, 2. Another is to drop tuples with least productivity—which can be estimated by sketches.

Which Tuples should be Dropped? Depending on the objectives: Max-subset of the joined result Generate a result that is an unbiased sample the actual join result This is what is needed estimate aggregates Dropping tuples at random accomplish neither objective But sketches can be very effective.

Three-Relation Joins Experiments [LawZ06] Query: Synthetic Data Sets: 10 dense regions with different zipfian factors Data Distribution: Several techniques tested

Experiments and Results Rand: random drop--worst MSketch: drop lowest productivity tuples estimated using sketeches on mutijoins (best for maxsubset) BJoin: converting to a multi-binary join—2nd best Aging: drop the oldest tuple ‡ MSketch*Aging: scaling the priority by its remaining lifetime. ‡ MSketch_RS: drop tuples with largest fraction already produced (good for random sampling) ‡ Poor performance: This shows that remaining lifetime is not important in optimizing load shedding for max subset

References—sketches, Histograms,Quantiles [AMS96] Alon,, Matias, Szegedy. “The space complexity of approximating the frequency ments”, ACM STOC’1996. [AGM99] N. Alon, P.B. Gibbons, Y. Matias, M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. ACM PODS, 1999. [CMN98] S. Chaudhuri, R. Motwani, and V. Narasayya. “Random Sampling for Histogram Construction: How much is enough?”. ACM SIGMOD 1998. [DGG02] A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams. ACM SIGMOD, 2002. [FM85] P. Flajolet, G.N. Martin. “Probabilistic Counting Algorithms for Data Base Applications”. JCSS 31(2), 1985. [Gang07] Sumit Ganguly: Counting distinct items over update streams. Theor. Comput. Sci. 378(3): 211-222 (2007) [GGI02] A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss. Fast, small-space algorithms for approximate histogram maintenance. ACM STOC, 2002. [GK01] M. Greenwald and S. Khanna. “Space-Efficient Online Computation of Quantile Summaries”. ACM SIGMOD 2001. [GKM01] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries. VLDB 2001. [GKM02] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. “How to Summarize the Universe: Dynamic Maintenance of Quantiles”. VLDB 2002. [GKS01b] S. Guha, N. Koudas, and K. Shim. “Data Streams and Histograms”. ACM STOC 2001. [GMP97] P. B. Gibbons, Y. Matias, and V. Poosala. “Fast Incremental Maintenance of Approximate Histograms”. VLDB 1997.

References—sketches, Histograms …(cont.) [IKM00] P. Indyk, N. Koudas, S. Muthukrishnan. Identifying representative trends in massive time series data sets using sketches. VLDB, 2000. [IP99] Y.E. Ioannidis and V. Poosala. “Histogram-Based Approximation of Set-Valued Query Answers”. VLDB 1999. [JKM98] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. “Optimal Histograms with Quality Guarantees”. VLDB 1998. [MRL98] G.S. Manku, S. Rajagopalan, and B. G. Lindsay. “Approximate Medians and other Quantiles in One Pass and with Limited Memory”. ACM SIGMOD 1998. [MRL99] G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. ACM SIGMOD, 1999. [MVW00] Y. Matias, J.S. Vitter, and M. Wang. “Dynamic Maintenance of Wavelet-based Histograms”. VLDB 2000. [LawZ06] Yan-Nei Law, Carlo Zaniolo: Load Shedding for Window Joins on Multiple Data Streams. ICDE Workshops 2007: 674-683 [PIH96] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. “Improved Histograms for Selectivity Estimation of Range Predicates”. ACM SIGMOD 1996. [PSC84] G. Piatetsky-Shapiro and C. Connell. “Accurate Estimation of the Number of Tuples Satisfying a Condition”. ACM SIGMOD 1984. [TGI02] N. Thaper, S. Guha, P. Indyk, N. Koudas. Dynamic Multidimensional Histograms. ACM SIGMOD, 2002. [MZ11]Hamid Mousavi, Carlo Zaniolo: Fast and accurate computation of equi-depth histograms over data streams. EDBT/ICDT '11