Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.

Slides:



Advertisements
Similar presentations
Chapter 9. Performance Management Enterprise wide endeavor Research and ascertain all performance problems – not just DBMS Five factors influence DB performance.
Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
CSCI 4717/5717 Computer Architecture
C6 Databases.
MapReduce Online Veli Hasanov Fatih University.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
CS 540 Database Management Systems
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Adaptive Query Processing for Wide-Area Distributed Data Michael Franklin UC Berkeley Joint work with Tolga Urhan, Laurent Amsaleg, and Anthony Tomasic.
Fjording the Stream: An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael J. Franklin University of California, Berkeley Proceedings.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
PSoup Kevin Menard CS 561 4/11/2005. Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 with Michael J. Franklin.
Dynamic Plan Migration for Continuous Query over Data Streams Yali Zhu, Elke Rundensteiner and George Heineman Database System Research Group Worcester.
Adaptive Query Processing for Wide-Area Distributed Data Michael Franklin University of Maryland Joint work with Tolga Urhan, Laurent Amsaleg, and Anthony.
1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman Presented by: Bhuvan.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
Adaptive Query Processing for Wide-Area Distributed Data Michael Franklin UC Berkeley Joint work with Tolga Urhan, Laurent Amsaleg, and Anthony Tomasic.
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
Challenges in Sensor Network Query Processing Sam Madden NEST Retreat January 15, 2002.
Sensor Networks: Implications for Database Systems and Vice-Versa Michael Franklin January UCB Sensor Day.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Telegraph: A Universal System for Information. Telegraph History & Plans Initial Vision –Carey, Hellerstein, Stonebraker –“Regres”, “B-1” Sweat, ideas.
Eddies: Continuously Adaptive Query Processing Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein.
Data-Intensive Systems Michael Franklin UC Berkeley
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
施賀傑 何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.
1 XJoin: Faster Query Results Over Slow And Bursty Networks IEEE Bulletin, 2000 by T. Urhan and M Franklin Based on a talk prepared by Asima Silva & Leena.
Data Streams and Continuous Query Systems
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Sensor Database System Sultan Alhazmi
1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.
PermJoin: An Efficient Algorithm for Producing Early Results in Multi-join Query Plans Justin J. Levandoski Mohamed E. Khalefa Mohamed F. Mokbel University.
Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
An Example Data Stream Management System: TelegraphCQ INF5100, Autumn 2009 Jarle Søberg.
1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden SIGMOD 2002 June 4, 2002 With Mehul Shah, Joseph Hellerstein, and Vijayshankar.
Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
CS4432: Database Systems II Query Processing- Part 2.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
CS 540 Database Management Systems
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases Jianjun Chen et al Computer Sciences Dept. University of Wisconsin-Madison SIGMOD.
Query Optimization for Stream Databases Presented by: Guillermo Cabrera Fall 2008.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS4432: Database Systems II Query Processing- Part 1 1.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
William Stallings Computer Organization and Architecture
Telegraph: An Adaptive Global-Scale Query Engine
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Technical Capabilities
TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
PSoup: A System for streaming queries over streaming data
Eddies for Continuous Queries
Adaptive Query Processing (Background)
Information Capture and Re-Use
Challenges in Sensor Network Query Processing
Presentation transcript:

Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.

2 Overview Autonomic Computing is a technique for making computer systems self-managing. But, real issue is supporting autonomic organizations. They need information in a timely and tailored manner. workflows, business flows, news, business intelligence, sensor input, … The information infrastructure is the “nervous system” of the organization. Adaptive, responsive data management systems are essential.

3 Opportunities for Autonomic Data Mgmt. The functions currently expected of the DBA must be automated. (see next talk) – Data placement, index selection, caching, pre-fetching, “freshening”, … – We are pursuing an approach based on User Profiles. The query processing engine itself must be made “autonomic”. – This is the focus of today’s talk. – But, issues are related: profiles are a type of standing query.

4 Example 1: “Real-Time Business” Event-driven processing B2B and Enterprise apps – Supply-Chain, CRM – Trade Reconciliation, Order Processing etc. (Quasi) real-time flow of events and data Must manage these flows to drive business processes. Mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.

5 Example 2: Information Dissemination User Profiles Users Filtered Data Data Sources Doc creation or crawler initiates flow of data towards users. Users + system initiate flow of profiles back towards data.

6 Example 3: Sensor Nets Tiny devices measure the environment. – Berkeley “motes”, Smart Dust, Smart Tags, … Major research thrust at Berkeley. – Apps: Transportation, Seismic, Energy,… Form dynamic ad hoc networks, aggregate and communicate streams of values. Sensors are the stimulus receptors of the autonomic organization.

7 Common Features Centrality of dataflow – Architecture is focused on data movement – Moving streams of data through code in a network Requires intelligent, low-overhead routing Volatility of the environment – Dynamic resources & topology, partial failures – Long-running (never-ending?) tasks – Potential for user interaction during the flow – Large Scale: users, data, resources, … Requires adaptivity

8 Review: Distributed Query Processing SELECT eid,ename,title FROM Emp, Proj, Assign WHERE Emp.eid = Assign.eid AND Proj.pid = Assign.pid AND Emp.loc <> Proj.loc System handles query plan generation & optimization; ensures correct execution. Originally conceived for corporate networks. ©1998 Ozsu and Valduriez

9 – Sources may be unreachable or slow to respond. – Data statistics/cost estimates may be unavailable or unreliable. – No view into internal processing and structure. Traditional, static query processing approaches cannot cope with such problems at run-time. Wide-area + Wrapped sources  Unpredictability

10 Batch processing is inappropriate for many apps. – Must provide feedback to the user as quickly as possible. Data access becomes a cooperative, iterative process: – User may correct, refine, redirect, or change the query. User-in-the-loop  Unpredictability

11 – Mobility Location-centric queries Moving endpoints change data staging needs – Data Streams/Sensors Varying data arrival rates Adapting resolutions Push vs. Pull Queries can be long lived and streams may be infinite so a static plan is doomed to failure. Mobility & Data Streams  Unpredictability

12 Adaptive Query Processing Existing systems are adaptive but at a slow rate. – Collect Stats – Compile and Optimize Query – Eventually collect stats again or change schema – Re-compile and optimize if necessary. [Survey: Hellerstein, Franklin, et al., DE Bulletin 2000] static plans current DBMS

13 Adaptive Query Processing Goal: allow adjustments for runtime conditions. Basic idea: leave “options” in compiled plan. At runtime, bind options based on observed state: – available memory, load, cache contents, etc. Once bound, plan is followed for entire query. [HP88,GW89,IN+92,GC94,AC+96, AZ96,LP97] Dynamic, Parametric, Competitive, … static plans late binding current DBMS

14 Adaptive Query Processing Start with a compiled plan. Observe performance between blocking (or blocked) operators. Re-optimize remainder of plan if divergence from expected data delivery rate [AF+96,UFA98] or data statistics [KD98]. Dynamic, Parametric, Competitive, … static plans late binding inter- operator current DBMS Query Scrambling, MidQuery Re-opt

15 Adaptive Query Processing Join operators themselves can be made adaptive: – to user needs (Ripple Joins [HH99]) – to memory limitations (DPHJ [IF+99]) – to memory limiations and delays (XJoin [UF00]) Plan Re-optimization can also be done in mid- operation (Convergent QP [IHW02]) Dynamic, Parametric, Competitive, … static plans late binding inter- operator current DBMS Query Scrambling, MidQuery Re-opt Ripple Join, XJoin, DPHJ, Convergent QP intra- operator

16 Adaptive Query Processing This is the region that we are exploring in the Telegraph project at Berkeley. Dynamic, Parametric, Competitive, … static plans ??? late binding inter- operator per tuple current DBMS Query Scrambling, MidQuery Re-opt Eddies, CACQ XJoin, DPHJ Convergent QP ??? PSoup intra- operator

17 Telegraph Overview An adaptive system for large-scale dataflow processing.  A convergence of Data Management and Networking approaches. Based on an extensible set of operators: 1) Ingress (data access) operators  Screen Scraper, Napster/Gnutella readers,  File readers, Sensor Proxies 2) Non-Blocking Data processing operators  Selections (filters), XJoins, … 3) Adaptive Routing Operators  Eddies, STeMs, FLuX, etc.

18 Routing Operators: Eddies How to order and reorder operators over time? –Traditionally, use performance, economic/admin feedback –won’t work for never-ending queries over volatile streams Instead, use adaptive record routing. Reoptimization = change in routing policy [Avnur & Hellerstein, SIGMOD 00] static dataflow AB C D eddy A B C D

19 Eddy – Per Tuple Adaptivity Adjusts flow adaptively – Tuples routed through ops in diff. orders – Must visit each operator once before output State is maintained on a per-tuple basis Two complementary routing mechanisms – Back pressure: each operator has a queue, don’t route to ops with full queue – avoids expensive operators. – Lottery Scheduling: Give preference to operators that are more selective. Initial Results showed eddy could “learn” good plans and adjust to changes in stream contents over time. Currently in the process of exploring the inherent tradeoffs of such fine-grained adaptivity.

20 Traditional Hash Joins block when one input stalls. Non-Blocking Operators – Join Build Probe Source A Source B Hash Table A

21 Non-Blocking Operators – Join Hash Table A Hash Table B Symm Hash Join [WA91] blocks only if both stall. Processes tuples as they arrive from sources. Produces all answer tuples and no spurious duplicates. Build Probe Source A Source B

22 SteMs:“State Modules” [Raman 01] A generalization of the Symmetric Join Split join into modules that contain current state of the operators. – Basically a store with one or more indexes. – Allows sharing of state across multiple queries. Supports: XJoin, Continuous Queries, Competitive processing, …

23 Using SteMs for Joins Hash A Hash B Hash C Hash D A B C D SteMs maintain intermediate state for multiple joins. Use Eddy to route tuples through the necessary modules. Be sure to enforce “build then probe” processing. Note, can also maintain results of individual joins in SteMs.

24 CACQ [Madden et al, SIGMOD 02]   Stocks. symbol = ‘MSFT’ Stocks. symbol = ‘APPL’ Query 2Query 1 Stock Quotes ‘MSFT’ ‘APPL’ Stock Quotes   Expect hundreds to thousands of queries over same sources Continuously Adaptive Continuous Queries – combine operators from many queries to improve efficiency (i.e. share work).

25 Combining Queries in CACQ Q1 = Select * From S where S.a = s1 and S.b = s4; Q2 = Select * From S where S.a = s2 and S.b = s5; Q3 = Select * From S where S.a = s3 and S.b = s6; Can also use SteMs to store query specifications. Need a good predicate index if many queries. Eddy now is routing tuples for multiple queries simultaneously CACQ leverages Eddies for adaptive CQ processing Results show advantages over static approaches.

26 In The Beginning Data Query Index Result

27 Pub Sub/CQ/Filtering Queries Data Index Result These systems can be looked at as database systems turned upsidedown.

28 PSoup [Chandrasekaran 02] Logical (?) conclusion: – Treat queries and data as duals. CACQ etc are “filters”: – No historical data. – Answers continuously returned as new data streams into the system. Such continuous answers are often: – Infeasible – intermittent connectivity and “Data Recharging” profiles. – Inefficient – often user does not want to be continuously interrupted.

29 PSoup (cont.) Query processing is treated as an n-way symmetric join between data tuples and query specifications. PSoup model: repeated invocation of standing queries over windows of streaming data.

30 PSoup: Query & Data Duality Queries Index Result Data Index

31 PSoup: Query & Data Duality Queries Index Result Data Index Query

32 PSoup (continued) Queries specify a “window” (e.g. last 10 min) Queries are registered with the system – These are entered into a “Query SteM” Data tuples stream into the system – These are entered into a “Data SteM” Matching is done on the fly (in background). – Effectively keeping shared materialized views. When a query is invoked: – The query window is applied to the materialized answer and results are returned to the user.

33 Data Store IDR.bR.a PSoup Query Store IDPredicate 0<R.a<=5 R.a=4 AND R.b=3 0>R.b>4 R.a>4 AND R.b= PSoup: New Select Query Arrival SELECT * FROM R WHERE R.a <= 4 AND R.b >=3 BEGIN (NOW – 600) END (NOW) NEW QUERY B U I L D R.a<=4 AND R.b>=324

34 PROBE Data Store IDR.bR.a R.a<=4 AND R.b>=324 New Selection Spec (continued) Queries T F 24 F T F D a t a RESULTS

35 Data Store IDR.bR.a Query Store IDPredicate 0<R.a<=5 R.a=4 AND R.b=3 0>R.b>4 R.a>4 AND R.b= R.a<=4 AND R.b>=3 Selection – new data 6353 N EW D ATA B U I L D 63 PSoup

36 PROBE 6353 Query Store IDPredicate 0<R.a<=5 R.a=4 AND R.b=3 0>R.b>4 R.a>4 AND R.b= R.a<=4 AND R.b>=3 Selection – new data (cont.) Queries D a t a RESULTS 53 FTFF T

37 PSoup – Query Invocation PSoup continuously maintains materialized views over streaming data and queries. Data is not returned to user until the query is invoked. – Invocation requires applying “windows” to precomputed results. Adaptive approach allows system to continuously absorb new data and new queries without recompilation. Lots of issues to study: – Query indexing, Spilling to disk, bulk processing

38 Conclusions The goal is to support autonomic organizations. – The information infrastructure is the “nervous system” of such organizations. Existing QP technology is simply too brittle. The Telegraph project at Berkeley is “pushing the envelope” on adaptive query processing. static plans ??? late binding inter- operator per tuple Eddies, CACQ ??? PSoup intra- operator