Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Similar presentations


Presentation on theme: "Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University."— Presentation transcript:

1 Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University

2 stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Sensor networks, RFID tags –Network monitoring and traffic engineering –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets

3 stanfordstreamdatamanager 3 Using Traditional Database User/Application Loader QueryResult Result…Query… Table R Table S

4 stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query Stream Query Processor Result Input streams

5 stanfordstreamdatamanager 5 Example Continuous Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

6 stanfordstreamdatamanager 6 Data Stream Management System (DSMS) Data Stream Management System (DSMS) Input Streams Register Continuous Query Streamed Result Stored Result Archive Stored Tables

7 stanfordstreamdatamanager 7 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data

8 stanfordstreamdatamanager 8 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics

9 stanfordstreamdatamanager 9 Optimizing Continuous Queries is Challenging Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time  Adaptive processing: use plan that is best for current conditions

10 stanfordstreamdatamanager 10 Roadmap StreaMon: Our adaptive query processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work –Similar techniques apply to conventional databases

11 stanfordstreamdatamanager 11 Traditional Optimization  StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency

12 stanfordstreamdatamanager 12 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3

13 stanfordstreamdatamanager 13 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples  F  (1)  F  (2) …  …  F  (n) Goal: Minimize expected cost to process a tuple

14 stanfordstreamdatamanager 14 Pipelined Filters: Example 1 2 3 4 4 5 6 8 11 22 3 77 12 F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

15 stanfordstreamdatamanager 15 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80

16 stanfordstreamdatamanager 16 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stable statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon

17 stanfordstreamdatamanager 17 Pipelined Filters: Stable Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79,IK84,KBZ86,H94] –Correlations  NP-Hard Greedy algorithm: Use conditional statistics –F  (1) has maximum drop-rate/cost –F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) –And so on

18 stanfordstreamdatamanager 18 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P  NP [MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead  A-Greedy: Adaptive Greedy

19 stanfordstreamdatamanager 19A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

20 stanfordstreamdatamanager 20 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates: exponential! Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

21 stanfordstreamdatamanager 21 Profile Window 1 2 3 4 4 5 6 8 11 22 3 77 4 0110 0011 1001 1001 Profile Window 1 F1F2F3F4

22 stanfordstreamdatamanager 22 Greedy Ordering Using Profile Window 1010 0001 1010 0100 0100 0011 F1F2F3F4 2232 F1F2F3F4 3222 F3F1F2F4 021 3222 F3F2F4F1 201 10 Matrix View  Greedy Ordering

23 stanfordstreamdatamanager 23 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection/correction of changes in Greedy order  Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004

24 stanfordstreamdatamanager 24 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy

25 stanfordstreamdatamanager 25 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants

26 stanfordstreamdatamanager 26 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View 1010 0001 1010 0100 0100 0011 3222 201 0 10 Profile WindowMatrix View

27 stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast

28 stanfordstreamdatamanager 28 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic testbed –Can control stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

29 stanfordstreamdatamanager 29 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

30 stanfordstreamdatamanager 30 Effect of Filter Drop-Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

31 stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

32 stanfordstreamdatamanager 32 Run-time Overhead

33 stanfordstreamdatamanager 33Adaptivity Permute selectivities here Progress of time (x1000 tuples processed)

34 stanfordstreamdatamanager 34 Roadmap StreaMon: Our adaptive processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work

35 stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results

36 stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R

37 stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T

38 stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈

39 stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S

40 stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch

41 stanfordstreamdatamanager 41 Can we get best of both worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T

42 stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment

43 stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Adaptive –Flexible with respect to memory usage Captures whole spectrum from MJoins to Tree Joins and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

44 stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders  candidate caches Adaptive selection from candidate caches Adaptive memory allocation to chosen caches

45 stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches

46 stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U

47 stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]

48 stanfordstreamdatamanager 48 A-Caching: Results at a glance Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively Approximation algorithms  scalable Different types of caches Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)

49 stanfordstreamdatamanager 49 Current and Future Work Broadening StreaMon’s scope, e.g., –Shared computation among multiple queries –Parallelism Rio: Adaptive query processing in conventional database systems Plan logging: A new overall approach to address certain “meta issues” in adaptive processing

50 stanfordstreamdatamanager 50 Related Work Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04] New approaches to query optimization –E.g., parametric [GW89,INS + 92,HS03], expected- cost based [CHS99,CHG02], error-aware [VN03]

51 stanfordstreamdatamanager 51 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data size and query complexity CS-wide push towards autonomic computing Our goal: Adaptive Data Management System –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional DBMS Google keywords: shivnath, stanford stream

52 stanfordstreamdatamanager 52 Performance of Stream-Join Plans

53 stanfordstreamdatamanager 53 Adaptivity to Memory Availability

54 stanfordstreamdatamanager 54 Plan Logging Log the profiling and re-optimization history –Query is long-running –Example view over log for R S T Rate(R) …..   R,S) PlanCost 1024 ….. 0.75P1P1 12762 5642 ….. 0.72P2P2 72332 934 ….. 0.76P1P1 12003 ⋈ ⋈ Plans lying in a high- dimensional space of statistics Rate(R)   R,S) P1P1 P2P2

55 stanfordstreamdatamanager 55 Uses of Plan Logging Reducing re-optimization overhead –Create a cache of plans Reducing profiling overhead –Track how changes in a statistic contribute to changes in best plan Rate(R)   R,S) P1P1 P2P2

56 stanfordstreamdatamanager 56 Uses of Plan Logging (contd.) Tracking “Return of Investment” on adaptive processing –Track cost versus benefit of adaptivity –Is there a single plan that would have good overall performance? Avoiding thrashing –Which statistics have transient changes?

57 stanfordstreamdatamanager 57 Adaptive Processing in Traditional DBMS Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Errors

58 stanfordstreamdatamanager 58 Proactive Re-optimization with Rio QueryWhich statistics are required Estimates “Robust“ plans Combined for efficiency Stats. Mgr. + Profiler: Collects statistics at run-time based on random samples (Re-)optimizer: Considers pairs of (estimate, uncertainty) during optimization + uncertainty Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Optimizer: Finds “best” query plan to process this query Executor: Executes current plan Executor: Runs chosen plan to completion


Download ppt "Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University."

Similar presentations


Ads by Google