Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Similar presentations


Presentation on theme: "Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University."— Presentation transcript:

1 Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University

2 stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Network monitoring and traffic engineering –Sensor networks, RFID tags –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets

3 stanfordstreamdatamanager 3 Using a Traditional Database User/Application Loader QueryResult Table R Table S Result…Query…

4 stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query (Standing Query) Stream Query Processor Input streams Result

5 stanfordstreamdatamanager 5 Example Continuous (Standing) Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

6 stanfordstreamdatamanager 6 Data Stream Management System Data Stream Management System (DSMS) Register Continuous Query Input Streams Streamed Result Stored Tables Stored Result 1312 3422 7150 9045 Archive

7 stanfordstreamdatamanager 7 Data Stream Management System Data Stream Management System (DSMS) Register Continuous Query Input Streams Streamed Result DSMS System Components Execution plan operators and synopsis data stores Query processor Memory manager Metadata and statistics catalog and others User/application interface Operator scheduler Query Processor

8 stanfordstreamdatamanager 8 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data

9 stanfordstreamdatamanager 9 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics

10 stanfordstreamdatamanager 10 Optimizing Continuous Queries Poses New Challenges Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time Adaptive processing: use plan that is best for current conditions

11 stanfordstreamdatamanager 11 Query Processor Rest of this Talk Register Continuous Query Input Streams Streamed Result DSMS System Components StreaMon Adaptive Query Processing Engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Adaptive use of input stream properties for resource mgmt. Adaptive ordering of commutative filters Adaptive caching for multiway joins

12 stanfordstreamdatamanager 12 Traditional Optimization  StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency

13 stanfordstreamdatamanager 13 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3

14 stanfordstreamdatamanager 14 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples  F  (1)  F  (2) …  …  F  (n) Goal: Minimize expected cost to process a tuple

15 stanfordstreamdatamanager 15 Pipelined Filters: Example 1 2 3 4 4 5 6 8 11 22 3 77 1 2 F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

16 stanfordstreamdatamanager 16 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80

17 stanfordstreamdatamanager 17 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stationary statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon

18 stanfordstreamdatamanager 18 Pipelined Filters: Stationary Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79, IK84, KBZ86, H94] –Correlations  NP-Hard Greedy algorithm: Use conditional statistics –F  (1) has maximum drop-rate/cost –F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) –And so on

19 stanfordstreamdatamanager 19 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P  NP [CFK03, FLT04, MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead  A-Greedy: Adaptive Greedy

20 stanfordstreamdatamanager 20A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

21 stanfordstreamdatamanager 21 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates -- exponential Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

22 stanfordstreamdatamanager 22 Profile Window 1 2 3 4 4 5 6 8 11 22 3 77 4 0110 0011 1001 1001 Profile Window 1 F1F2F3F4 4

23 stanfordstreamdatamanager 23 Greedy Ordering Using Profile Window 1010 0001 1010 0100 0100 0011 F1F2F3F4 2232 F1F2F3F4 3222 F3F1F2F4 021 3222 F3F2F4F1 201 10 Matrix View  Greedy Ordering

24 stanfordstreamdatamanager 24 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection & correction of changes in Greedy order  Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, ACM SIGMOD 2004

25 stanfordstreamdatamanager 25 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy

26 stanfordstreamdatamanager 26 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants

27 stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View 1010 0001 1010 0100 0100 0011 3222 201 0 10 Profile WindowMatrix View

28 stanfordstreamdatamanager 28 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast

29 stanfordstreamdatamanager 29 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic stream-generation testbed –Can control & vary stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

30 stanfordstreamdatamanager 30 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

31 stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

32 stanfordstreamdatamanager 32 Run-time Overhead

33 stanfordstreamdatamanager 33Adaptivity Stream data properties changed here Progress of time (x1000 tuples processed)

34 stanfordstreamdatamanager 34 Remainder of Talk Adaptive caching for multiway joins Current and future research directions Related work

35 stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results

36 stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R

37 stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T

38 stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈

39 stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S

40 stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch

41 stanfordstreamdatamanager 41 Can We Get Best of Both Worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T

42 stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment

43 stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Enables plan switching with almost no overhead –Flexible with respect to memory availability Captures whole spectrum from MJoins to Tree Joins, and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

44 stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders  candidate caches Adaptive selection from candidate caches –Based on profiled costs and benefits of caches Adaptive memory allocation to chosen caches Problems are individually NP-Hard –Efficient approximation algorithms  scalable Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005

45 stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches

46 stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U

47 stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]

48 stanfordstreamdatamanager 48 Remainder of Talk Current and future research directions Related work

49 stanfordstreamdatamanager 49 Research Directions Data Management System Applications OS/Hardware Applications Streams Sensors P2P Privacy IRData Mining PubSub Federation System complexity is growing beyond control Up to 80% of IT budgets spent on maintenance [McKinsey] –“People cost” dominates Data Management Systems must be self-managing –Autonomic Systems –Self-configuration, self-optimization, self-healing, and self-protection Adaptive processing is key Data Formats & Access Records XML Text Web pages Stored procedures Web services Indexes Views Partitions OS/Hardware Linux Windows Local disks SAN RAID Remote DB NAS PDA

50 stanfordstreamdatamanager 50 New Techniques for Autonomic Systems & Adaptive Processing Bringing more components under adaptive management –Ex: Parallelism, overload management, memory allocation, sharing data & computation across queries Being proactive as well as reactive –Rio prototype system for adaptive processing in conventional database systems [BBD04] –Considering uncertainty in statistics to choose robust query execution plans [BBD04] –Plan logging: A new overall approach to adaptive processing of continuous queries [BB05]

51 stanfordstreamdatamanager 51 Future Work in DSMSs Expanding the declarative interface –Event detection (e.g., regular expressions), data cubes, decision trees and other data mining models Scaling in data arrival rates –Graphical Processing Units (GPUs) as stream co- processors, Stanford’s streaming supercomputer Many others –Disk usage based on query and stream properties –Handling missing or imprecise data in streams –Handling hybrid workloads of continuous & regular queries

52 stanfordstreamdatamanager 52 Other Work More on StreaMon –Using stream properties for resource optimization [TODS] –Theory of pipelined processing [ICDT 2005] –System demonstration [SIGMOD 2004] Stanford’s STREAM DSMS –Overview papers, source code release, web demo Other aspects of a DSMS –Operator scheduling [SIGMOD 2003, VLDB Journal] –Continuous query language & semantics [VLDB Journal] –Memory requirements for continuous queries [PODS 2002, TODS]

53 stanfordstreamdatamanager 53 Other Work (contd.) Adaptive query processing architectures –Taxonomy of past work, next steps [CIDR 2005] –Adaptive processing for conventional database systems [Rio system, Technical report] –Concurrent use of multiple plans for same query [Technical report] Summer internship –Compressing relations using data mining [SIGMOD 2001]

54 stanfordstreamdatamanager 54 Related Work (Brief) Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04], Tukwila [IFF + 99] New approaches to query optimization –E.g., parametric [GW89, INS + 92, HS03], expected- cost based [CHS99, CHG02], error-aware [VN03] Other DSMSs –E.g., Aurora, Gigascope, Nile, TelegraphCQ

55 stanfordstreamdatamanager 55 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data sizes, query complexity, overall system complexity CS-wide push towards Autonomic Systems Our goal: Adaptive Data Management Systems –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional Databases Google keywords: “shivnath”, “stanford stream”


Download ppt "Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University."

Similar presentations


Ads by Google