Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University
Differences from Previous Talk Our focus: Aggregation queries No quality of service specifications Instead, focus on accuracy of query answers Compensate for dropped data by scaling answers Random drops only (no semantic drops)
Problem Setting Σ Σ Σ Q1 Q2 Q3 R S1 S2 Sliding Window Aggregate Queries (SUM and COUNT) Σ Σ Σ Filters, UDFs, and Joins w/ Relations Operator Sharing R S1 S2
Inputs to the Problem Σ Σ Σ Q1 Q2 Q3 R S1 S2 Std Dev σ Mean μ Processing Time t Selectivity s R S1 S2 Stream Rate r
Load Shedding via Random Drops (time, selectivity) 1 2 Σ3 S Scale answer by 1/p (t3, s3) Load = rt1 + rs1t2 + rs1s2t3 (t2, s2) Load = rt1 + p(rs1t2 + rs1s2t3) Sampling Rate p (t1, s1) Need Load ≤ 1 Stream Rate r
Problem Statement Relative error is metric of choice: |Estimate - Actual| Actual Goal: Minimize the maximum relative error across queries, subject to Load ≤ 1 Want low error with high probability
Relating Load Shedding and Error Query-dependent constant Relative error for query i Sampling rate for query i Equation derived from Hoeffding bounds Constant Ci depends on: Variance of aggregated attribute Sliding window size
Calculate Ratio of Sampling Rates Minimize maximum relative error → Equal relative error across queries Express all sampling rates in terms of common variable λ
Placing Load Shedders Σ Σ Target .8λ Target .6λ Sampling Rate .75 = .6λ /.8λ Sampling Rate .8λ
Conclusion Load shedding helps cope with bursts Minimizing relative error is natural objective for aggregate queries Algorithm for load shedding: Relate target sampling rates for all queries Place random drop operators based on target sampling rates Adjust sampling rates to achieve desired load
Thanks for listening! Questions?
Choosing Target Sampling Rates Relative Error Sampling rate for query Variance of aggregated attribute Sliding window size
Measuring Inaccuracy Σ3 2 1 Tuple w/ value x: x / (p1p2) Scale answer by 1/(p1p2) Tuple w/ value x: x / (p1p2) with pr. p1p2 with pr. 1-p1p2 Σ3 Sampling Rate p2 2 Key point: Product of sampling rates determines quality of approximate answer Sampling Rate p1 1