Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a.

Similar presentations


Presentation on theme: "1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a."— Presentation transcript:

1 1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a VLDB’02 tutorial by Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

2 2 Synopses and Approximation zSynopsis: bounded-memory history-approximation ySuccinct summary of old stream tuples yLike indexes/materialized-views, but base data is unavailable yExamples xSliding Windows xSamples xHistograms xWavelet representation xSketching techniques zApproximate Algorithms: e.g., median, quantiles,… zFast and light Data Mining algorithms

3 3 Overview of Stream Synopses zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance

4 4 Garofalakis, Gehrke, Rastogi, VLDB’02 # Sampling: Basics Idea: A small random sample S of the data often well- represents all the data –For a fast approx answer, apply “modified” query to S –Example: select agg from R where odd(R.e) (n=12) –If agg is avg, return average of odd elements in S –If agg is count, return average over all elements e in S of 1 if e is odd 0 if e is even Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 Sample S: 9 5 1 8 answer: 5 answer: 12*3/4 =9

5 5 Probabilistic Guarantees zExample: Actual answer is within 5 ± 1 with prob  0.9 zUse Tail Inequalities to give probabilistic bounds on returned answer yMarkov Inequality yChebyshev’s Inequality yHoeffding’s Inequality yChernoff Bound

6 6 Sampling—some background zReservoir Sampling [Vit85]: Maintains a sample S having a pre- assigned size M on a stream of arbitrary size yAdd each new element to S with probability M/n, where n is the current number of stream elements yIf add an element, evict a random element from S yInstead of flipping a coin for each element, determine the number of elements to skip before the next to be added to S zConcise sampling [GM98]: Duplicates in sample S stored as pairs (thus, potentially boosting actual sample size) zCounting Samples [GM98]: for answering hot list queries (k most frequent values) zWindow Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.

7 7 Load Shedding Using Samples zGiven a complex Query graph how to use/manage the sampling process [BDM04] zMore about this later [LawZ02]

8 8 Overview zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance zSketches

9 9 Histograms zHistograms approximate the frequency distribution of element values in a stream zA histogram (typically) consists of yA partitioning of element domain values into buckets yA count per bucket B (of the number of elements in B) zWidely used in DBMS query optimization Many Types of Proposed: zEqui-Depth Histograms: select buckets such that counts per bucket are equal zV-Optimal Histograms: select buckets to minimize frequency variance within buckets zWavelet-based Histograms

10 10 Garofalakis, Gehrke, Rastogi, VLDB’02 # Types of Histograms Equi-Depth Histograms –Idea: Select buckets such that counts per bucket are equal V-Optimal Histograms [IP95] [JKM98] –Idea: Select buckets to minimize frequency variance within buckets Count for bucket Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count for bucket Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

11 11 Equi-Depth Histogram Construction zFor histogram with b buckets, compute elements with rank n/b, 2n/b,..., (b-1)n/b zExample: (n=12, b=4) Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 After sort: 1 1 2 3 4 5 5 6 7 8 9 9 rank = 3 (.25-quantile) rank = 6 (.5-quantile) rank = 9 (.75-quantile)

12 12 Garofalakis, Gehrke, Rastogi, VLDB’02 # Answering Queries Histograms [IP99] (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation Example: select count(*) from R where 4 <= R.e <= 15 For equi-depth histograms, maximum error: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Count spread evenly among bucket values 4  R.e  15 answer: 3.5 *

13 13 Approximate Algorithms zQuantiles Using Samples zQuantiles from Synopses zOne pass algorithms for approximate samples … zMuch work in this area … omitted

14 14 Overview zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance zSketches

15 15 Garofalakis, Gehrke, Rastogi, VLDB’02 # One-Dimensional Haar Wavelets Wavelets: Mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: Simplest wavelet basis, easy to understand and implement –Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, 4][0, -1, -1, 0] [1.5, 4][0.5, 0] [2.75][-1.25] ----3 2 1 0 Haar wavelet decomposition:[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

16 16 Garofalakis, Gehrke, Rastogi, VLDB’02 # Haar Wavelet Coefficients Coefficient “Supports” 2 2 0 2 3 5 4 4 -1.252.750.5 0 0 0 + - + + + ++ + + - - ---- + - + + - + - + - + - - + + - 0.5 0 2.75 -1.25 0 0 Hierarchical decomposition structure (a.k.a. “error tree”) Original frequency distribution

17 17 Compressed Wavelet Representations Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution Steps zCompute cumulative frequency distribution C zCompute linear wavelet transform of C zGreedy heuristic methods yRetain coefficients leading to large error reduction yThrow away coefficients that give small increase in error

18 18 Overview zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance zSketches

19 19 Sketches zConventional data summaries fall short: yQuantiles and 1-d histograms: Cannot capture attribute correlations ySamples (e.g., using Reservoir Sampling) perform poorly for joins yMulti-d histograms/wavelets: Construction requires multiple passes over the data Randomized sketch synopses zDifferent approach: Randomized sketch synopses yOnly logarithmic space yProbabilistic guarantees on the quality of the approximate answer y Can handle extreme cases.

20 20 Overview zWindows: logical, physical (covered) zSamples: Answering queries using samples zHistograms: Equi-depth histograms, On-line quantile computation zWavelets: Haar-wavelet histogram construction & maintenance zSketches zQoS by load shedding.

21 21 QoS and Load Schedding zWhen input stream rate exceeds system capacity a stream manager can shed load (tuples) z Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss z Introducing load shedding in a data stream manager is a challenging problem z Random load shedding or semantic load shedding

22 22 Load Shedding in Aurora z QoS for each application as a function relating output to its utility – Delay based, drop based, value based zTechniques for introducing load shedding operators in a plan such that QoS isdisrupted the least – Determining when, where and how much load to shed

23 23 Load Shedding in STREAM zFormulate load shedding as an optimization problem for multiple sliding window aggregate queries – Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate zConsider placement of load shedding operators in query plan – Each operator sheds load uniformly with probability pi

24 24 References [BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”, Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms, p.633–634, 2002. [BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication. [Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985. [GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving Approximate Query Answers”. ACM SIGMOD 1998. [BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361. [lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008.


Download ppt "1 Approximation and Load Shedding for QoS in DSMS* CS240B Notes By Carlo Zaniolo CSD--UCLA ________________________________________ * Notes based on a."

Similar presentations


Ads by Google