Download presentation
Presentation is loading. Please wait.
1
Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University
2
stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Sensor networks, RFID tags –Network monitoring and traffic engineering –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets
3
stanfordstreamdatamanager 3 Using Traditional Database User/Application Loader QueryResult Result…Query… Table R Table S
4
stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query Stream Query Processor Result Input streams
5
stanfordstreamdatamanager 5 Example Continuous Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes
6
stanfordstreamdatamanager 6 Data Stream Management System (DSMS) Data Stream Management System (DSMS) Input Streams Register Continuous Query Streamed Result Stored Result Archive Stored Tables
7
stanfordstreamdatamanager 7 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data
8
stanfordstreamdatamanager 8 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics
9
stanfordstreamdatamanager 9 Optimizing Continuous Queries is Challenging Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time Adaptive processing: use plan that is best for current conditions
10
stanfordstreamdatamanager 10 Roadmap StreaMon: Our adaptive query processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work –Similar techniques apply to conventional databases
11
stanfordstreamdatamanager 11 Traditional Optimization StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency
12
stanfordstreamdatamanager 12 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3
13
stanfordstreamdatamanager 13 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples F (1) F (2) … … F (n) Goal: Minimize expected cost to process a tuple
14
stanfordstreamdatamanager 14 Pipelined Filters: Example 1 2 3 4 4 5 6 8 11 22 3 77 12 F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible
15
stanfordstreamdatamanager 15 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80
16
stanfordstreamdatamanager 16 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stable statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon
17
stanfordstreamdatamanager 17 Pipelined Filters: Stable Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79,IK84,KBZ86,H94] –Correlations NP-Hard Greedy algorithm: Use conditional statistics –F (1) has maximum drop-rate/cost –F (2) has maximum drop-rate/cost ratio for tuples not dropped by F (1) –And so on
18
stanfordstreamdatamanager 18 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P NP [MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead A-Greedy: Adaptive Greedy
19
stanfordstreamdatamanager 19A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering
20
stanfordstreamdatamanager 20 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates: exponential! Profile Window: Sampled statistics from which required conditional drop-rates can be estimated
21
stanfordstreamdatamanager 21 Profile Window 1 2 3 4 4 5 6 8 11 22 3 77 4 0110 0011 1001 1001 Profile Window 1 F1F2F3F4
22
stanfordstreamdatamanager 22 Greedy Ordering Using Profile Window 1010 0001 1010 0100 0100 0011 F1F2F3F4 2232 F1F2F3F4 3222 F3F1F2F4 021 3222 F3F2F4F1 201 10 Matrix View Greedy Ordering
23
stanfordstreamdatamanager 23 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection/correction of changes in Greedy order Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004
24
stanfordstreamdatamanager 24 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy
25
stanfordstreamdatamanager 25 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants
26
stanfordstreamdatamanager 26 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View 1010 0001 1010 0100 0100 0011 3222 201 0 10 Profile WindowMatrix View
27
stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast
28
stanfordstreamdatamanager 28 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic testbed –Can control stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory
29
stanfordstreamdatamanager 29 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps
30
stanfordstreamdatamanager 30 Effect of Filter Drop-Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps
31
stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps
32
stanfordstreamdatamanager 32 Run-time Overhead
33
stanfordstreamdatamanager 33Adaptivity Permute selectivities here Progress of time (x1000 tuples processed)
34
stanfordstreamdatamanager 34 Roadmap StreaMon: Our adaptive processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work
35
stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results
36
stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R
37
stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T
38
stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈
39
stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S
40
stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch
41
stanfordstreamdatamanager 41 Can we get best of both worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T
42
stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment
43
stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Adaptive –Flexible with respect to memory usage Captures whole spectrum from MJoins to Tree Joins and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines
44
stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders candidate caches Adaptive selection from candidate caches Adaptive memory allocation to chosen caches
45
stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches
46
stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U
47
stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]
48
stanfordstreamdatamanager 48 A-Caching: Results at a glance Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively Approximation algorithms scalable Different types of caches Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)
49
stanfordstreamdatamanager 49 Current and Future Work Broadening StreaMon’s scope, e.g., –Shared computation among multiple queries –Parallelism Rio: Adaptive query processing in conventional database systems Plan logging: A new overall approach to address certain “meta issues” in adaptive processing
50
stanfordstreamdatamanager 50 Related Work Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04] New approaches to query optimization –E.g., parametric [GW89,INS + 92,HS03], expected- cost based [CHS99,CHG02], error-aware [VN03]
51
stanfordstreamdatamanager 51 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data size and query complexity CS-wide push towards autonomic computing Our goal: Adaptive Data Management System –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional DBMS Google keywords: shivnath, stanford stream
52
stanfordstreamdatamanager 52 Performance of Stream-Join Plans
53
stanfordstreamdatamanager 53 Adaptivity to Memory Availability
54
stanfordstreamdatamanager 54 Plan Logging Log the profiling and re-optimization history –Query is long-running –Example view over log for R S T Rate(R) ….. R,S) PlanCost 1024 ….. 0.75P1P1 12762 5642 ….. 0.72P2P2 72332 934 ….. 0.76P1P1 12003 ⋈ ⋈ Plans lying in a high- dimensional space of statistics Rate(R) R,S) P1P1 P2P2
55
stanfordstreamdatamanager 55 Uses of Plan Logging Reducing re-optimization overhead –Create a cache of plans Reducing profiling overhead –Track how changes in a statistic contribute to changes in best plan Rate(R) R,S) P1P1 P2P2
56
stanfordstreamdatamanager 56 Uses of Plan Logging (contd.) Tracking “Return of Investment” on adaptive processing –Track cost versus benefit of adaptivity –Is there a single plan that would have good overall performance? Avoiding thrashing –Which statistics have transient changes?
57
stanfordstreamdatamanager 57 Adaptive Processing in Traditional DBMS Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Errors
58
stanfordstreamdatamanager 58 Proactive Re-optimization with Rio QueryWhich statistics are required Estimates “Robust“ plans Combined for efficiency Stats. Mgr. + Profiler: Collects statistics at run-time based on random samples (Re-)optimizer: Considers pairs of (estimate, uncertainty) during optimization + uncertainty Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Optimizer: Finds “best” query plan to process this query Executor: Executes current plan Executor: Runs chosen plan to completion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.