1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a.

1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a VLDB’02 tutorial by Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

2 Synopses and Approximation zSynopsis: bounded-memory history-approximation ySuccinct summary of old stream tuples yLike indexes/materialized-views, but base data is unavailable yExamples xSliding Windows xSamples xHistograms xWavelet representation xSketching techniques zApproximate Algorithms: e.g., median, quantiles,… zFast and light Data Mining algorithms

3 Overview of Stream Synopses zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance

4 Garofalakis, Gehrke, Rastogi, VLDB’02 # Sampling: Basics Idea: A small random sample S of the data often well- represents all the data –For a fast approx answer, apply “modified” query to S –Example: select agg from R where odd(R.e) (n=12) –If agg is avg, return average of odd elements in S –If agg is count, return average over all elements e in S of 1 if e is odd 0 if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 12*3/4 =9

5 Probabilistic Guarantees zExample: Actual answer is within 5 ± 1 with prob  0.9 zUse Tail Inequalities to give probabilistic bounds on returned answer yMarkov Inequality yChebyshev’s Inequality yHoeffding’s Inequality yChernoff Bound

6 Sampling—some background zReservoir Sampling [Vit85]: Maintains a sample S having a pre- assigned size M on a stream of arbitrary size yAdd each new element to S with probability M/n, where n is the current number of stream elements yIf add an element, evict a random element from S yInstead of flipping a coin for each element, determine the number of elements to skip before the next to be added to S zConcise sampling [GM98]: Duplicates in sample S stored as pairs (thus, potentially boosting actual sample size) zCounting Samples [GM98]: for answering hot list queries (k most frequent values) zWindow Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.

7 Load Shedding Using Samples zGiven a complex Query graph how to use/manage the sampling process [BDM04] zMore about this later [LawZ02]

8 Overview zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance zSketches

9 Histograms zHistograms approximate the frequency distribution of element values in a stream zA histogram (typically) consists of yA partitioning of element domain values into buckets yA count per bucket B (of the number of elements in B) zWidely used in DBMS query optimization Many Types of Proposed: zEqui-Depth Histograms: select buckets such that counts per bucket are equal zV-Optimal Histograms: select buckets to minimize frequency variance within buckets zWavelet-based Histograms

10 Garofalakis, Gehrke, Rastogi, VLDB’02 # Types of Histograms Equi-Depth Histograms –Idea: Select buckets such that counts per bucket are equal V-Optimal Histograms [IP95] [JKM98] –Idea: Select buckets to minimize frequency variance within buckets Count for bucket Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count for bucket Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

11 Equi-Depth Histogram Construction zFor histogram with b buckets, compute elements with rank n/b, 2n/b,..., (b-1)n/b zExample: (n=12, b=4) Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 3 (.25-quantile) rank = 6 (.5-quantile) rank = 9 (.75-quantile)

12 Garofalakis, Gehrke, Rastogi, VLDB’02 # Answering Queries Histograms [IP99] (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation Example: select count(*) from R where 4 <= R.e <= 15 For equi-depth histograms, maximum error: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count spread evenly among bucket values 4  R.e  15 answer: 3.5 *

13 Approximate Algorithms zQuantiles Using Samples zQuantiles from Synopses zOne pass algorithms for approximate samples … zMuch work in this area … omitted

15 Garofalakis, Gehrke, Rastogi, VLDB’02 # One-Dimensional Haar Wavelets Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet basis, easy to understand and implement –Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, 4][0, -1, -1, 0] [1.5, 4][0.5, 0] [2.75][-1.25] ----3 2 1 0 Haar wavelet decomposition:[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

16 Garofalakis, Gehrke, Rastogi, VLDB’02 # Haar Wavelet Coefficients Coefficient “Supports” 2 2 0 2 3 5 4 4 -1.252.750.5 0 0 0 + - + + + ++ + + - - ---- + - + + - + - + - + - - + + - 0.5 0 2.75 -1.25 0 0 Hierarchical decomposition structure (a.k.a. “error tree”) Original frequency distribution

17 Compressed Wavelet Representations Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution Steps zCompute cumulative frequency distribution C zCompute linear wavelet transform of C zGreedy heuristic methods yRetain coefficients leading to large error reduction yThrow away coefficients that give small increase in error

19 Sketches zConventional data summaries fall short: yQuantiles and 1-d histograms: Cannot capture attribute correlations ySamples (e.g., using Reservoir Sampling) perform poorly for joins yMulti-d histograms/wavelets: Construction requires multiple passes over the data Randomized sketch synopses zDifferent approach: Randomized sketch synopses yOnly logarithmic space yProbabilistic guarantees on the quality of the approximate answer y Can handle extreme cases.

20 Overview zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance zSketches zQoS by load shedding.

21 QoS and Load Schedding zWhen input stream rate exceeds system capacity a stream manager can shed load (tuples) z Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss z Introducing load shedding in a data stream manager is a challenging problem z Random load shedding or semantic load shedding

22 Load Shedding in Aurora z QoS for each application as a function relating output to its utility – Delay based, drop based, value based zTechniques for introducing load shedding operators in a plan such that QoS isdisrupted the least – Determining when, where and how much load to shed

23 Load Shedding in STREAM zFormulate load shedding as an optimization problem for multiple sliding window aggregate queries – Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate zConsider placement of load shedding operators in query plan – Each operator sheds load uniformly with probability pi

24 References [BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”, Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms, p.633–634, 2002. [BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication. [Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985. [GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving Approximate Query Answers”. ACM SIGMOD 1998. [BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361. [lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008.

1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a.

Similar presentations

Presentation on theme: "1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a.

Similar presentations

Presentation on theme: "1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a."— Presentation transcript:

Similar presentations

About project

Feedback