Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University

stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Network monitoring and traffic engineering –Sensor networks, RFID tags –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets

stanfordstreamdatamanager 3 Using a Traditional Database User/Application Loader QueryResult Table R Table S Result…Query…

stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query (Standing Query) Stream Query Processor Input streams Result

stanfordstreamdatamanager 5 Example Continuous (Standing) Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

stanfordstreamdatamanager 6 Data Stream Management System Data Stream Management System (DSMS) Register Continuous Query Input Streams Streamed Result Stored Tables Stored Result 1312 3422 7150 9045 Archive

stanfordstreamdatamanager 7 Data Stream Management System Data Stream Management System (DSMS) Register Continuous Query Input Streams Streamed Result DSMS System Components Execution plan operators and synopsis data stores Query processor Memory manager Metadata and statistics catalog and others User/application interface Operator scheduler Query Processor

stanfordstreamdatamanager 8 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data

stanfordstreamdatamanager 9 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics

stanfordstreamdatamanager 10 Optimizing Continuous Queries Poses New Challenges Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time Adaptive processing: use plan that is best for current conditions

stanfordstreamdatamanager 11 Query Processor Rest of this Talk Register Continuous Query Input Streams Streamed Result DSMS System Components StreaMon Adaptive Query Processing Engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Adaptive use of input stream properties for resource mgmt. Adaptive ordering of commutative filters Adaptive caching for multiway joins

stanfordstreamdatamanager 12 Traditional Optimization  StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency

stanfordstreamdatamanager 13 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3

stanfordstreamdatamanager 14 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples  F  (1)  F  (2) …  …  F  (n) Goal: Minimize expected cost to process a tuple

stanfordstreamdatamanager 15 Pipelined Filters: Example 1 2 3 4 4 5 6 8 11 22 3 77 1 2 F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

stanfordstreamdatamanager 16 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80

stanfordstreamdatamanager 17 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stationary statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon

stanfordstreamdatamanager 18 Pipelined Filters: Stationary Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79, IK84, KBZ86, H94] –Correlations  NP-Hard Greedy algorithm: Use conditional statistics –F  (1) has maximum drop-rate/cost –F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) –And so on

stanfordstreamdatamanager 19 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P  NP [CFK03, FLT04, MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead  A-Greedy: Adaptive Greedy

stanfordstreamdatamanager 20A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

stanfordstreamdatamanager 21 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates -- exponential Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

stanfordstreamdatamanager 22 Profile Window 1 2 3 4 4 5 6 8 11 22 3 77 4 0110 0011 1001 1001 Profile Window 1 F1F2F3F4 4

stanfordstreamdatamanager 23 Greedy Ordering Using Profile Window 1010 0001 1010 0100 0100 0011 F1F2F3F4 2232 F1F2F3F4 3222 F3F1F2F4 021 3222 F3F2F4F1 201 10 Matrix View  Greedy Ordering

stanfordstreamdatamanager 24 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection & correction of changes in Greedy order  Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, ACM SIGMOD 2004

stanfordstreamdatamanager 25 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy

stanfordstreamdatamanager 26 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants

stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View 1010 0001 1010 0100 0100 0011 3222 201 0 10 Profile WindowMatrix View

stanfordstreamdatamanager 28 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast

stanfordstreamdatamanager 29 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic stream-generation testbed –Can control & vary stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

stanfordstreamdatamanager 30 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 32 Run-time Overhead

stanfordstreamdatamanager 33Adaptivity Stream data properties changed here Progress of time (x1000 tuples processed)

stanfordstreamdatamanager 34 Remainder of Talk Adaptive caching for multiway joins Current and future research directions Related work

stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results

stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R

stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T

stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈

stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S

stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch

stanfordstreamdatamanager 41 Can We Get Best of Both Worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T

stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment

stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Enables plan switching with almost no overhead –Flexible with respect to memory availability Captures whole spectrum from MJoins to Tree Joins, and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders  candidate caches Adaptive selection from candidate caches –Based on profiled costs and benefits of caches Adaptive memory allocation to chosen caches Problems are individually NP-Hard –Efficient approximation algorithms  scalable Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005

stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches

stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U

stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]

stanfordstreamdatamanager 48 Remainder of Talk Current and future research directions Related work

stanfordstreamdatamanager 49 Research Directions Data Management System Applications OS/Hardware Applications Streams Sensors P2P Privacy IRData Mining PubSub Federation System complexity is growing beyond control Up to 80% of IT budgets spent on maintenance [McKinsey] –“People cost” dominates Data Management Systems must be self-managing –Autonomic Systems –Self-configuration, self-optimization, self-healing, and self-protection Adaptive processing is key Data Formats & Access Records XML Text Web pages Stored procedures Web services Indexes Views Partitions OS/Hardware Linux Windows Local disks SAN RAID Remote DB NAS PDA

stanfordstreamdatamanager 50 New Techniques for Autonomic Systems & Adaptive Processing Bringing more components under adaptive management –Ex: Parallelism, overload management, memory allocation, sharing data & computation across queries Being proactive as well as reactive –Rio prototype system for adaptive processing in conventional database systems [BBD04] –Considering uncertainty in statistics to choose robust query execution plans [BBD04] –Plan logging: A new overall approach to adaptive processing of continuous queries [BB05]

stanfordstreamdatamanager 51 Future Work in DSMSs Expanding the declarative interface –Event detection (e.g., regular expressions), data cubes, decision trees and other data mining models Scaling in data arrival rates –Graphical Processing Units (GPUs) as stream co- processors, Stanford’s streaming supercomputer Many others –Disk usage based on query and stream properties –Handling missing or imprecise data in streams –Handling hybrid workloads of continuous & regular queries

stanfordstreamdatamanager 52 Other Work More on StreaMon –Using stream properties for resource optimization [TODS] –Theory of pipelined processing [ICDT 2005] –System demonstration [SIGMOD 2004] Stanford’s STREAM DSMS –Overview papers, source code release, web demo Other aspects of a DSMS –Operator scheduling [SIGMOD 2003, VLDB Journal] –Continuous query language & semantics [VLDB Journal] –Memory requirements for continuous queries [PODS 2002, TODS]

stanfordstreamdatamanager 53 Other Work (contd.) Adaptive query processing architectures –Taxonomy of past work, next steps [CIDR 2005] –Adaptive processing for conventional database systems [Rio system, Technical report] –Concurrent use of multiple plans for same query [Technical report] Summer internship –Compressing relations using data mining [SIGMOD 2001]

stanfordstreamdatamanager 54 Related Work (Brief) Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04], Tukwila [IFF + 99] New approaches to query optimization –E.g., parametric [GW89, INS + 92, HS03], expected- cost based [CHS99, CHG02], error-aware [VN03] Other DSMSs –E.g., Aurora, Gigascope, Nile, TelegraphCQ

stanfordstreamdatamanager 55 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data sizes, query complexity, overall system complexity CS-wide push towards Autonomic Systems Our goal: Adaptive Data Management Systems –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional Databases Google keywords: “shivnath”, “stanford stream”

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Similar presentations

Presentation on theme: "Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Similar presentations

Presentation on theme: "Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback