Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Database System Concepts and Architecture
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Achieving Adaptivity for OLAP-XML Federations Torben Bach Pedersen Aalborg University Joint work with Dennis Pedersen, TARGIT.
IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Adaptive Ordering of Pipelined Stream Filters S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom In Proc. of SIGMOD 2004, June 2004.
Operator Placement for In-Network Stream Query Processing.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
Chapter 14 The Second Component: The Database.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
Chapter 9 Classification And Forwarding. Outline.
Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.
BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
CPS 216: Advanced Database Systems Shivnath Babu.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Master Thesis Defense Jan Fiedler 04/17/98
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
CompSci Self-Managing Systems Shivnath Babu.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
Self-Managing Cost Models Shivnath Babu Stanford University.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
BARD / April BARD: Bayesian-Assisted Resource Discovery Fred Stann (USC/ISI) Joint Work With John Heidemann (USC/ISI) April 9, 2004.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
Yaping Zhu with: Jennifer Rexford (Princeton University) Aman Shaikh and Subhabrata Sen (ATT Research) Route Oracle: Where Have.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Danilo Florissi, Yechiam Yemini (YY), Sushil da Silva, Hao Huang Columbia University, New York, NY 10027
Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)
Retele de senzori Curs 1 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Applying Control Theory to Stream Processing Systems
Proactive Re-optimization
RE-Tree: An Efficient Index Structure for Regular Expressions
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Data Warehousing Concepts
Presentation transcript:

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University

stanfordstreamdatamanager 2 Data Streams data streamsNew applications -- data as continuous, rapid, time-varying data streams –Network monitoring and traffic engineering –Sensor networks, RFID tags –Financial applications –Telecom call records –Web logs and click-streams –Manufacturing processes data setsTraditional databases -- data stored in finite, persistent data sets

stanfordstreamdatamanager 3 Using a Traditional Database User/Application Loader QueryResult Table R Table S Result…Query…

stanfordstreamdatamanager 4 New Approach for Data Streams User/Application Register Continuous Query (Standing Query) Stream Query Processor Input streams Result

stanfordstreamdatamanager 5 Example Continuous (Standing) Queries Web –Amazon’s best sellers over last hour Network Intrusion Detection –Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Finance –Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

stanfordstreamdatamanager 6 Data Stream Management System Data Stream Management System (DSMS) Register Continuous Query Input Streams Streamed Result Stored Tables Stored Result Archive

stanfordstreamdatamanager 7 Data Stream Management System Data Stream Management System (DSMS) Register Continuous Query Input Streams Streamed Result DSMS System Components Execution plan operators and synopsis data stores Query processor Memory manager Metadata and statistics catalog and others User/application interface Operator scheduler Query Processor

stanfordstreamdatamanager 8 Primer on Database Query Processing Preprocessing Query Optimization Query Execution Best query execution plan Canonical form Declarative Query Results Database System Data

stanfordstreamdatamanager 9 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., data sizes, histograms Which statistics are required Estimated statistics Data, auxiliary structures, statistics

stanfordstreamdatamanager 10 Optimizing Continuous Queries Poses New Challenges Continuous queries are long-running Stream properties can change while query runs –Data properties: value distributions –Arrival properties: bursts, delays System conditions can change Performance of a fixed plan can change significantly over time Adaptive processing: use plan that is best for current conditions

stanfordstreamdatamanager 11 Query Processor Rest of this Talk Register Continuous Query Input Streams Streamed Result DSMS System Components StreaMon Adaptive Query Processing Engine Adaptive ordering of commutative filters Adaptive caching for multiway joins Adaptive use of input stream properties for resource mgmt. Adaptive ordering of commutative filters Adaptive caching for multiway joins

stanfordstreamdatamanager 12 Traditional Optimization  StreaMon Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Re-optimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan on incoming stream tuples Decisions to adapt Combined in part for efficiency

stanfordstreamdatamanager 13 Pipelined Filters Commutative filters over a stream Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” Simple to complex filters –Boolean predicates –Table lookups –Pattern matching –User-defined functions Filter1 PacketsPackets Bad packets Filter2 Filter3

stanfordstreamdatamanager 14 Pipelined Filters: Problem Definition Continuous Query: F 1 Æ F 2 … Æ … F n Plan: Tuples  F  (1)  F  (2) …  …  F  (n) Goal: Minimize expected cost to process a tuple

stanfordstreamdatamanager 15 Pipelined Filters: Example F1F2F3F4 1 Input tuples Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible

stanfordstreamdatamanager 16 Why is Our Problem Hard? Filter drop-rates and costs can change over time Filters can be correlated E.g., Protocol = HTTP and DestPort = 80

stanfordstreamdatamanager 17 Metrics for an Adaptive Algorithm Speed of adaptivity –Detecting changes and finding new plan Run-time overhead –Re-optimization, collecting statistics, plan switching Convergence properties –Plan properties under stationary statisticsProfilerProfilerRe-optimizerRe-optimizerExecutorExecutor StreaMonStreaMon

stanfordstreamdatamanager 18 Pipelined Filters: Stationary Statistics Assume statistics are not changing –Order filters by decreasing drop-rate/cost [MS79, IK84, KBZ86, H94] –Correlations  NP-Hard Greedy algorithm: Use conditional statistics –F  (1) has maximum drop-rate/cost –F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) –And so on

stanfordstreamdatamanager 19 Adaptive Version of Greedy Greedy gives strong guarantees –4-approximation, best poly-time approx. possible assuming P  NP [CFK03, FLT04, MBM + 05] –For arbitrary (correlated) characteristics –Usually optimal in experiments Challenge: –Online algorithm –Fast adaptivity to Greedy ordering –Low run-time overhead  A-Greedy: Adaptive Greedy

stanfordstreamdatamanager 20A-Greedy Profiler: Maintains conditional filter drop-rates and costs over recent tuples Executor: Processes tuples with current Greedy ordering Re-optimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

stanfordstreamdatamanager 21 A-Greedy’s Profiler Responsible for maintaining current statistics –Filter costs –Conditional filter drop-rates -- exponential Profile Window: Sampled statistics from which required conditional drop-rates can be estimated

stanfordstreamdatamanager 22 Profile Window Profile Window 1 F1F2F3F4 4

stanfordstreamdatamanager 23 Greedy Ordering Using Profile Window F1F2F3F F1F2F3F F3F1F2F F3F2F4F Matrix View  Greedy Ordering

stanfordstreamdatamanager 24 A-Greedy’s Re-optimizer Maintains Matrix View over Profile Window –Easy to incorporate filter costs –Efficient incremental update –Fast detection & correction of changes in Greedy order  Details in [BMM + 04]: “Adaptive Processing of Pipelined Stream Filters”, ACM SIGMOD 2004

stanfordstreamdatamanager 25 Next Tradeoffs and variations of A-Greedy Experimental results for A-Greedy

stanfordstreamdatamanager 26 Tradeoffs Suppose: –Changes are infrequent –Slower adaptivity is okay –Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Spectrum of A-Greedy variants

stanfordstreamdatamanager 27 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Profile WindowMatrix View

stanfordstreamdatamanager 28 Variants of A-Greedy AlgorithmConvergence Properties Run-time Overhead Adap. A-Greedy4-approx.High (relative to others) Fast Matrix View Sweep4-approx.Less work per sampling step Slow Local-SwapsMay get caught in local optima Less work per sampling step Slow IndependentMisses correlations Lower sampling rate Fast

stanfordstreamdatamanager 29 Experimental Setup Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon Studied convergence properties, run-time overhead, and adaptivity Synthetic stream-generation testbed –Can control & vary stream data and arrival properties DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory

stanfordstreamdatamanager 30 Converged Processing Rate Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 31 Effect of Correlation Optimal-Fixed Sweep A-Greedy Independent Local-Swaps

stanfordstreamdatamanager 32 Run-time Overhead

stanfordstreamdatamanager 33Adaptivity Stream data properties changed here Progress of time (x1000 tuples processed)

stanfordstreamdatamanager 34 Remainder of Talk Adaptive caching for multiway joins Current and future research directions Related work

stanfordstreamdatamanager 35 Stream Joins Sensor R Sensor S Sensor T DSMS observations in the last minute join results

stanfordstreamdatamanager 36 MJoins (VNB04) ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T ⋈S⋈S ⋈T⋈T ⋈S⋈S ⋈R⋈R

stanfordstreamdatamanager 37 Excessive Recomputation in MJoins ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T

stanfordstreamdatamanager 38 Materializing Join Subexpressions Window on RWindow on SWindow on T ⋈ Fully- materialized join subexpression ⋈

stanfordstreamdatamanager 39 Tree Joins: Trees of Binary Joins RR SS TT ⋈ ⋈ Fully-materialized join subexpression Window on R Window on T Window on S

stanfordstreamdatamanager 40 Hard State Hinders Adaptivity RR SS TT ⋈ ⋈ W R W T ⋈ SS TT ⋈ ⋈ W S W T ⋈ RR Plan switch

stanfordstreamdatamanager 41 Can We Get Best of Both Worlds? MJoinTree Join Θ Recomputation Θ Less adaptive Θ Higher memory use ⋈ ⋈ W R W T ⋈ ⋈S⋈S ⋈T⋈T RST ⋈T⋈T ⋈R⋈R ⋈R⋈R ⋈S⋈S R S T

stanfordstreamdatamanager 42 MJoins + Caches ⋈R⋈R ⋈T⋈T Window on RWindow on SWindow on T WRWR WTWT ⋈ S tuple Cache Probe Bypass pipeline segment

stanfordstreamdatamanager 43 MJoins + Caches (contd.) Caches are soft state –Enables plan switching with almost no overhead –Flexible with respect to memory availability Captures whole spectrum from MJoins to Tree Joins, and plans in between Challenge: adaptive algorithm to choose join operator orders and caches in pipelines

stanfordstreamdatamanager 44 Adaptive Caching (A-Caching) Adaptive join ordering with A-Greedy or variant –Join operator orders  candidate caches Adaptive selection from candidate caches –Based on profiled costs and benefits of caches Adaptive memory allocation to chosen caches Problems are individually NP-Hard –Efficient approximation algorithms  scalable Details in [BMW + 05]: “Adaptive Caching for Continuous Queries”, ICDE 2005

stanfordstreamdatamanager 45 A-Caching (caching part only) Profiler: Estimates costs and benefits of candidate caches Executor: MJoins with caches Re-optimizer: Ensures that maximum-benefit subset of candidate caches is used List of candidate caches Estimated statistics Combined in part for efficiency Add/remove caches

stanfordstreamdatamanager 46 Performance of Stream-Join Plans (1) Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW + 05] ⋈ ⋈ R T S ⋈ U

stanfordstreamdatamanager 47 Performance of Stream-Join Plans (2) Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW + 05]

stanfordstreamdatamanager 48 Remainder of Talk Current and future research directions Related work

stanfordstreamdatamanager 49 Research Directions Data Management System Applications OS/Hardware Applications Streams Sensors P2P Privacy IRData Mining PubSub Federation System complexity is growing beyond control Up to 80% of IT budgets spent on maintenance [McKinsey] –“People cost” dominates Data Management Systems must be self-managing –Autonomic Systems –Self-configuration, self-optimization, self-healing, and self-protection Adaptive processing is key Data Formats & Access Records XML Text Web pages Stored procedures Web services Indexes Views Partitions OS/Hardware Linux Windows Local disks SAN RAID Remote DB NAS PDA

stanfordstreamdatamanager 50 New Techniques for Autonomic Systems & Adaptive Processing Bringing more components under adaptive management –Ex: Parallelism, overload management, memory allocation, sharing data & computation across queries Being proactive as well as reactive –Rio prototype system for adaptive processing in conventional database systems [BBD04] –Considering uncertainty in statistics to choose robust query execution plans [BBD04] –Plan logging: A new overall approach to adaptive processing of continuous queries [BB05]

stanfordstreamdatamanager 51 Future Work in DSMSs Expanding the declarative interface –Event detection (e.g., regular expressions), data cubes, decision trees and other data mining models Scaling in data arrival rates –Graphical Processing Units (GPUs) as stream co- processors, Stanford’s streaming supercomputer Many others –Disk usage based on query and stream properties –Handling missing or imprecise data in streams –Handling hybrid workloads of continuous & regular queries

stanfordstreamdatamanager 52 Other Work More on StreaMon –Using stream properties for resource optimization [TODS] –Theory of pipelined processing [ICDT 2005] –System demonstration [SIGMOD 2004] Stanford’s STREAM DSMS –Overview papers, source code release, web demo Other aspects of a DSMS –Operator scheduling [SIGMOD 2003, VLDB Journal] –Continuous query language & semantics [VLDB Journal] –Memory requirements for continuous queries [PODS 2002, TODS]

stanfordstreamdatamanager 53 Other Work (contd.) Adaptive query processing architectures –Taxonomy of past work, next steps [CIDR 2005] –Adaptive processing for conventional database systems [Rio system, Technical report] –Concurrent use of multiple plans for same query [Technical report] Summer internship –Compressing relations using data mining [SIGMOD 2001]

stanfordstreamdatamanager 54 Related Work (Brief) Adaptive processing of continuous queries –E.g., Eddies [AH00], NiagaraCQ [CDT + 00] Adaptive processing in conventional databases –Inter-query adaptivity, e.g., Leo [SLM + 01], [BC03] –Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS + 04], Tukwila [IFF + 99] New approaches to query optimization –E.g., parametric [GW89, INS + 92, HS03], expected- cost based [CHS99, CHG02], error-aware [VN03] Other DSMSs –E.g., Aurora, Gigascope, Nile, TelegraphCQ

stanfordstreamdatamanager 55 Summary New trends demand adaptive query processing –New applications, e.g., continuous queries, data streams –Increasing data sizes, query complexity, overall system complexity CS-wide push towards Autonomic Systems Our goal: Adaptive Data Management Systems –StreaMon: Adaptive Data Stream Engine –Rio: Adaptive Processing in Conventional Databases Google keywords: “shivnath”, “stanford stream”