Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.

Model-Based Semantic Compression for Network-Data Tables Shivnath Babu Stanford University Bell Laboratories with Minos Garofalakis, Rajeev Rastogi, Avi.

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Query Optimization over Web Services Utkarsh Srivastava Jennifer Widom Jennifer Widom Kamesh Munagala Rajeev Motwani.

IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.

1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,

Adaptive Ordering of Pipelined Stream Filters S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom In Proc. of SIGMOD 2004, June 2004.

Operator Placement for In-Network Stream Query Processing.

Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,

CMSC724: Database Management Systems Instructor: Amol Deshpande

Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,

Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

CPS 216: Advanced Database Systems Shivnath Babu.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.

CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.

Master Thesis Defense Jan Fiedler 04/17/98

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

Data Stream Management Systems

Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.

BARD / April BARD: Bayesian-Assisted Resource Discovery Fred Stann (USC/ISI) Joint Work With John Heidemann (USC/ISI) April 9, 2004.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

Eddies: Continuously Adaptive Query Processing Ross Rosemark.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Yaping Zhu with: Jennifer Rexford (Princeton University) Aman Shaikh and Subhabrata Sen (ATT Research) Route Oracle: Where Have.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.

1 Hidra: History Based Dynamic Resource Allocation For Server Clusters Jayanth Gummaraju 1 and Yoshio Turner 2 1 Stanford University, CA, USA 2 Hewlett-Packard.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)

Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,

SketchVisor: Robust Network Measurement for Software Packet Processing

Applying Control Theory to Stream Processing Systems

Proactive Re-optimization

RE-Tree: An Efficient Index Structure for Regular Expressions

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Data Integration with Dependent Sources

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Lu Tang , Qun Huang, Patrick P. C. Lee

Presentation transcript:

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University

stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Sensor networks, RFID tags –Network monitoring and traffic engineering –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets

stanfordstreamdatamanager 3 Using Traditional Database User/Application Loader QueryResult Result…Query… Table R Table S

stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query Stream Query Processor Result Input streams

stanfordstreamdatamanager 5 Example Continuous Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

stanfordstreamdatamanager 6 Data Stream Management System (DSMS) Data Stream Management System (DSMS) Input Streams Register Continuous Query Streamed Result Stored Result Archive Stored Tables

stanfordstreamdatamanager 7 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data

stanfordstreamdatamanager 8 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics

stanfordstreamdatamanager 9 Optimizing Continuous Queries is Challenging Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time  Adaptive processing: use plan that is best for current conditions

stanfordstreamdatamanager 10 Roadmap StreaMon: Our adaptive query processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work –Similar techniques apply to conventional databases

stanfordstreamdatamanager 11 Traditional Optimization  StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency

stanfordstreamdatamanager 12 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3

stanfordstreamdatamanager 13 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples  F  (1)  F  (2) …  …  F  (n) Goal: Minimize expected cost to process a tuple

stanfordstreamdatamanager 14 Pipelined Filters: Example F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

stanfordstreamdatamanager 15 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80

stanfordstreamdatamanager 16 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stable statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon

stanfordstreamdatamanager 17 Pipelined Filters: Stable Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79,IK84,KBZ86,H94] –Correlations  NP-Hard Greedy algorithm: Use conditional statistics –F  (1) has maximum drop-rate/cost –F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) –And so on

stanfordstreamdatamanager 18 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P  NP [MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead  A-Greedy: Adaptive Greedy

stanfordstreamdatamanager 19A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

stanfordstreamdatamanager 20 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates: exponential! Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

stanfordstreamdatamanager 21 Profile Window Profile Window 1 F1F2F3F4

stanfordstreamdatamanager 22 Greedy Ordering Using Profile Window F1F2F3F F1F2F3F F3F1F2F F3F2F4F Matrix View  Greedy Ordering

stanfordstreamdatamanager 23 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection/correction of changes in Greedy order  Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, SIGMOD 2004

stanfordstreamdatamanager 24 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy

stanfordstreamdatamanager 25 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants

stanfordstreamdatamanager 26 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Profile WindowMatrix View

stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast

stanfordstreamdatamanager 28 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic testbed –Can control stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

stanfordstreamdatamanager 29 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 30 Effect of Filter Drop-Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 32 Run-time Overhead

stanfordstreamdatamanager 33Adaptivity Permute selectivities here Progress of time (x1000 tuples processed)

stanfordstreamdatamanager 34 Roadmap StreaMon: Our adaptive processing engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Current and future work

stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results

stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R

stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T

stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈

stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S

stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch

stanfordstreamdatamanager 41 Can we get best of both worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T

stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment

stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Adaptive –Flexible with respect to memory usage Captures whole spectrum from MJoins to Tree Joins and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders  candidate caches Adaptive selection from candidate caches Adaptive memory allocation to chosen caches

stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches

stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U

stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]

stanfordstreamdatamanager 48 A-Caching: Results at a glance Capture whole spectrum from Fully-pipelined MJoins to Tree-based joins adaptively Approximation algorithms  scalable Different types of caches Up to 7x improvement with respect to MJoin and 2x improvement with respect to TreeJoin Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 (To appear)

stanfordstreamdatamanager 49 Current and Future Work Broadening StreaMon’s scope, e.g., –Shared computation among multiple queries –Parallelism Rio: Adaptive query processing in conventional database systems Plan logging: A new overall approach to address certain “meta issues” in adaptive processing

stanfordstreamdatamanager 50 Related Work Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04] New approaches to query optimization –E.g., parametric [GW89,INS + 92,HS03], expected- cost based [CHS99,CHG02], error-aware [VN03]

stanfordstreamdatamanager 51 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data size and query complexity CS-wide push towards autonomic computing Our goal: Adaptive Data Management System –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional DBMS Google keywords: shivnath, stanford stream

stanfordstreamdatamanager 52 Performance of Stream-Join Plans

stanfordstreamdatamanager 53 Adaptivity to Memory Availability

stanfordstreamdatamanager 54 Plan Logging Log the profiling and re-optimization history –Query is long-running –Example view over log for R S T Rate(R) …..   R,S) PlanCost 1024 … P1P … P2P … P1P ⋈ ⋈ Plans lying in a high- dimensional space of statistics Rate(R)   R,S) P1P1 P2P2

stanfordstreamdatamanager 55 Uses of Plan Logging Reducing re-optimization overhead –Create a cache of plans Reducing profiling overhead –Track how changes in a statistic contribute to changes in best plan Rate(R)   R,S) P1P1 P2P2

stanfordstreamdatamanager 56 Uses of Plan Logging (contd.) Tracking “Return of Investment” on adaptive processing –Track cost versus benefit of adaptivity –Is there a single plan that would have good overall performance? Avoiding thrashing –Which statistics have transient changes?

stanfordstreamdatamanager 57 Adaptive Processing in Traditional DBMS Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Errors

stanfordstreamdatamanager 58 Proactive Re-optimization with Rio QueryWhich statistics are required Estimates “Robust“ plans Combined for efficiency Stats. Mgr. + Profiler: Collects statistics at run-time based on random samples (Re-)optimizer: Considers pairs of (estimate, uncertainty) during optimization + uncertainty Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Optimizer: Finds “best” query plan to process this query Executor: Executes current plan Executor: Runs chosen plan to completion