Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Similar presentations


Presentation on theme: "Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani."— Presentation transcript:

1 Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

2 Monitoring Data Streams Lots of data arrives as continuous data streams Network traffic, web clickstreams, financial data feeds, sensor data, etc. We could load it into a database and query it But processing streaming data has advantages: Timeliness Detect interesting events in real time Take appropriate action immediately Performance Avoid use of (slow) secondary storage Can process higher volumes of data more cheaply

3 Network Traffic Monitoring Security (e.g. intrusion detection) Network performance troubleshooting Traffic management (e.g. routing policy) Internet

4 Data Streams are Bursty Data stream arrival rates are often: Fast Irregular Examples: Network traffic (IP, telephony, etc.) E-mail messages Web page access patterns Peak rate much higher than average rate 1-2 orders of magnitude Impractical to provision system for peak rate

5 Bursts Create Backlogs Arrival rate temporarily exceeds throughput Queues of unprocessed elements build up Two options when memory fills up Page to disk Slows system, lowers throughput Admission control (i.e. drop packets) Data is lost, answer quality suffers Neither option is very appealing 

6 Two Approaches to Bursts 1)Minimize memory usage Reduce memory used to buffer data backlog → avoid running out of memory Schedule query operators so as to release memory quickly during bursts Sometimes this is not enough… 2)Shed load intelligently to minimize inaccuracy Use approximate query answering techniques Some queries are harder than others to approximate Give hard queries more data and easy queries less

7 Outline Operator Scheduling Load Shedding Problem Formalization Intuition Behind the Solution Chain Scheduling Algorithm Near-Optimality of Chain Scheduling Experimental Results

8 Problem Formalization Inputs: Data flow path(s) consisting of sequences of operators For each operator we know: Execution time (per block) Selectivity σ Σ Query #1 σ σ Query #2 Stream Time: t 1 Selectivity: s 1 Time: t 2 Selectivity: s 2 Time: t 3 Selectivity: s 3 Time: t 4 Selectivity: s 4

9 Progress charts (0,0)(6,0) Time Block Size (0,1) (1,0.5) (4,0.25) Opt1 Opt2 Opt3 σ σ σ

10 Problem Formalization Inputs: Data flow path(s) consisting of sequences of operators For each operator we know: Execution time (per block) Selectivity At each time step: Blocks of tuples may arrive at initial input queue(s) Scheduler selects one block of tuples Selected block moves one step on its progress chart Objective: Minimize peak memory usage (sum of queue sizes) σ Σ Query #1 σ σ Query #2 Stream Time: t 1 Selectivity: s 1 Time: t 2 Selectivity: s 2 Time: t 3 Selectivity: s 3 Time: t 4 Selectivity: s 4

11 Main Solution Idea Fast, selective operators release memory quickly Therefore, to minimize memory: Give preference to fast, selective operators Postpone slow, unselective operators Greedy algorithm: Operator priority = selectivity per unit time (s i /t i ) Always schedule the highest-priority available operator Greedy doesn’t quite work… A “good” operator that follows a “bad” operator rarely runs The “bad” operator doesn’t get scheduled Therefore there is no input available for the “good” operator

12 Bad Example for Greedy Opt1 Opt2 Opt3 Time Block Size Tuples build up here

13 Chain Scheduling Algorithm Opt1 Opt2 Opt3 Lower envelope Time Block Size

14 Chain Scheduling Algorithm Calculate lower envelope Priority = slope of lower envelope segment Always schedule highest-priority available operator Break ties using operator order in pipeline Favor later operators

15 FIFO: Example (0,0)(6,0) (0,1) (1,0.5) (4,0.25) Opt1 Opt2 Opt3 Time Block Size

16 Chain: Example (0,0)(6,0) (0,1) (1,0.5) (4,0.25) Opt1 Opt2 Opt3 Lower envelope Time Block Size

17 Memory Usage

18 Chain is Near-Optimal Memory usage within small constant of optimal algorithm that knows the future Proof sketch: Greedy scheduling is optimal for convex progress charts “Best” operators are immediately available Lower envelope is convex Lower envelope closely approximates actual progress chart Details on next slide… Theorem: Given a system with k queries, all operator selectivities ≤ 1, Let C(t) = # of blocks of memory used by Chain at time t. At every time t, any algorithm must use ≥ C(t) - k memory.

19 Lemma: Lower Envelope is Close to Actual Progress Chart At most one block in the middle of each lower envelope segment Due to tie-breaking rule (Lower envelope + 1) gives upper bound on actual memory usage Additive error of 1 block per query Difference

20 Performance Comparison spike in memory due to burst

21 Outline Operator Scheduling Load Shedding Motivation for Load Shedding Problem Formalization Load Shedding Algorithm Experimental Results

22 Why Load Shedding? Data rate during the burst can be too fast for the system to keep up Chain Scheduling helps to minimize memory usage, but CPU may be the bottleneck Timely, approximate answers are often more useful than delayed, exact answers Solution: When there is too much data to handle, process as much as possible and drop the rest Goal: Minimize inaccuracy in answers while keeping up with the data

23 Related Approaches Our focus: sliding window aggregation queries Goal is minimizing inaccuracy in answers Previous work considered related questions: Maximize output rate from sliding window joins Kang, Naughton, and Viglas - ICDE 03 Maximize quality of service function for selection queries Tatbul, Cetintemel, Zdonik, Cherniak, Stonebraker-VLDB 03

24 Problem Setting   ΣΣ   Σ S1S1 S2S2 R Sliding Window Aggregate Queries (SUM and COUNT) Operator Sharing Filters, UDFs, and Joins w/ Relations Q1Q1 Q2Q2 Q3Q3

25 Inputs to the Problem   ΣΣ   Σ S1S1 S2S2 R Std Dev σ Mean μ Q1Q1 Q2Q2 Q3Q3 Stream Rate r Processing Time t Selectivity s

26 Load Shedding via Random Drops Stream Rate r 11 22 Σ3Σ3 S (t 1, s 1 ) (t 2, s 2 ) Load = rt 1 + rs 1 t 2 + rs 1 s 2 t 3 (t 3, s 3 ) Sampling Rate p (time, selectivity) Load = rt 1 + p(rs 1 t 2 + rs 1 s 2 t 3 ) Scale answer by 1/p Need Load ≤ 1

27 Problem Statement Relative error is metric of choice: |Estimate - Actual| Actual Goal: Minimize the maximum relative error across queries, subject to Load ≤ 1 Want low error with high probability

28 Quantifying Effects of Load Shedding 11 22 Σ3Σ3 Sampling Rate p 1 Sampling Rate p 2 Scale answer by 1/(p 1 p 2 ) Product of sampling rates determines answer quality 11 22 Σ3Σ3 Sampling Rate p 1 p 2 Scale answer by 1/(p 1 p 2 )

29 Relating Load Shedding and Error Relative error for query i Sampling rate for query i Query-dependent constant Equation derived from Hoeffding bound Constant C i depends on: Variance of aggregated attribute Sliding window size

30 Choosing Target Sampling Rates Sampling rate for query Variance of aggregated attribute Sliding window size Relative Error

31 Calculate Ratio of Sampling Rates Minimize maximum relative error → Equal relative error across queries Express all sampling rates in terms of common variable λ

32 Placing Load Shedders   Σ Σ Target.8λ Target.6λ Sampling Rate.8λ Sampling Rate.75 =.6λ /.8λ

33 Experimental Results

34

35 Conclusion Fluctuating data stream arrival rates create challenges Temporary system overload during bursts Chain scheduling helps minimize memory usage Main idea: give priority to fast, selective operators Careful load shedding preserves answer quality Relate target sampling rates for all queries Place random drop operators based on target sampling rates Adjust sampling rates to achieve desired load

36 Thanks for Listening! http://www-db.stanford.edu/stream


Download ppt "Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani."

Similar presentations


Ads by Google