Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University
stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Sensor networks, RFID tags –Network monitoring and traffic engineering –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets
stanfordstreamdatamanager 3 Using Traditional Database User/Application Loader QueryResult Result…Query… Table R Table S
stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query Stream Query Processor Result Input streams
stanfordstreamdatamanager 5 Example Continuous Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes
stanfordstreamdatamanager 6 Data Stream Management System (DSMS) Data Stream Management System (DSMS) Input Streams Register Continuous Query Streamed Result Stored Result Archive Stored Tables
stanfordstreamdatamanager 7 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data
stanfordstreamdatamanager 8 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics
stanfordstreamdatamanager 9 Optimizing Continuous Queries is Challenging Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time Adaptive processing: use plan that is best for current conditions
stanfordstreamdatamanager 10 Roadmap StreaMon: Our adaptive query processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work –Similar techniques apply to conventional databases
stanfordstreamdatamanager 11 Traditional Optimization StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency
stanfordstreamdatamanager 12 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3
stanfordstreamdatamanager 13 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples F (1) F (2) … … F (n) Goal: Minimize expected cost to process a tuple
stanfordstreamdatamanager 14 Pipelined Filters: Example F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible
stanfordstreamdatamanager 15 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80
stanfordstreamdatamanager 16 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stable statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon
stanfordstreamdatamanager 17 Pipelined Filters: Stable Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79,IK84,KBZ86,H94] –Correlations NP-Hard Greedy algorithm: Use conditional statistics –F (1) has maximum drop-rate/cost –F (2) has maximum drop-rate/cost ratio for tuples not dropped by F (1) –And so on
stanfordstreamdatamanager 18 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P NP [MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead A-Greedy: Adaptive Greedy
stanfordstreamdatamanager 19A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering
stanfordstreamdatamanager 20 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates: exponential! Profile Window: Sampled statistics from which required conditional drop-rates can be estimated
stanfordstreamdatamanager 21 Profile Window Profile Window 1 F1F2F3F4
stanfordstreamdatamanager 22 Greedy Ordering Using Profile Window F1F2F3F F1F2F3F F3F1F2F F3F2F4F Matrix View Greedy Ordering
stanfordstreamdatamanager 23 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection/correction of changes in Greedy order Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004
stanfordstreamdatamanager 24 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy
stanfordstreamdatamanager 25 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants
stanfordstreamdatamanager 26 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Profile WindowMatrix View
stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast
stanfordstreamdatamanager 28 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic testbed –Can control stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory
stanfordstreamdatamanager 29 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps
stanfordstreamdatamanager 30 Effect of Filter Drop-Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps
stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps
stanfordstreamdatamanager 32 Run-time Overhead
stanfordstreamdatamanager 33Adaptivity Permute selectivities here Progress of time (x1000 tuples processed)
stanfordstreamdatamanager 34 Roadmap StreaMon: Our adaptive processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work
stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results
stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R
stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T
stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈
stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S
stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch
stanfordstreamdatamanager 41 Can we get best of both worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T
stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment
stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Adaptive –Flexible with respect to memory usage Captures whole spectrum from MJoins to Tree Joins and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines
stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders candidate caches Adaptive selection from candidate caches Adaptive memory allocation to chosen caches
stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches
stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U
stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]
stanfordstreamdatamanager 48 A-Caching: Results at a glance Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively Approximation algorithms scalable Different types of caches Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)
stanfordstreamdatamanager 49 Current and Future Work Broadening StreaMon’s scope, e.g., –Shared computation among multiple queries –Parallelism Rio: Adaptive query processing in conventional database systems Plan logging: A new overall approach to address certain “meta issues” in adaptive processing
stanfordstreamdatamanager 50 Related Work Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04] New approaches to query optimization –E.g., parametric [GW89,INS + 92,HS03], expected- cost based [CHS99,CHG02], error-aware [VN03]
stanfordstreamdatamanager 51 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data size and query complexity CS-wide push towards autonomic computing Our goal: Adaptive Data Management System –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional DBMS Google keywords: shivnath, stanford stream
stanfordstreamdatamanager 52 Performance of Stream-Join Plans
stanfordstreamdatamanager 53 Adaptivity to Memory Availability
stanfordstreamdatamanager 54 Plan Logging Log the profiling and re-optimization history –Query is long-running –Example view over log for R S T Rate(R) ….. R,S) PlanCost 1024 … P1P … P2P … P1P ⋈ ⋈ Plans lying in a high- dimensional space of statistics Rate(R) R,S) P1P1 P2P2
stanfordstreamdatamanager 55 Uses of Plan Logging Reducing re-optimization overhead –Create a cache of plans Reducing profiling overhead –Track how changes in a statistic contribute to changes in best plan Rate(R) R,S) P1P1 P2P2
stanfordstreamdatamanager 56 Uses of Plan Logging (contd.) Tracking “Return of Investment” on adaptive processing –Track cost versus benefit of adaptivity –Is there a single plan that would have good overall performance? Avoiding thrashing –Which statistics have transient changes?
stanfordstreamdatamanager 57 Adaptive Processing in Traditional DBMS Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Errors
stanfordstreamdatamanager 58 Proactive Re-optimization with Rio QueryWhich statistics are required Estimates “Robust“ plans Combined for efficiency Stats. Mgr. + Profiler: Collects statistics at run-time based on random samples (Re-)optimizer: Considers pairs of (estimate, uncertainty) during optimization + uncertainty Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Optimizer: Finds “best” query plan to process this query Executor: Executes current plan Executor: Runs chosen plan to completion