Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager.

Slides:

Advertisements

Similar presentations

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Data Streams & Continuous Queries The Stanford STREAM Project stanfordstreamdatamanager.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom.

SWiM Panel on Engine Implementation Jennifer Widom.

1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.

Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Database Systems More SQL Database Design -- More SQL1.

Introduction to Structured Query Language (SQL)

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004.

The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,

SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.

Query Processing Presented by Aung S. Win.

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.

STREAM The Stanford Data Stream Management System.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.

Query Processing, Resource Management, and Approximation in a Data Stream Management System.

Database Management 9. course. Execution of queries.

Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)

1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君吳哲維林冠良.

BIS Database Systems School of Management, Business Information Systems, Assumption University A.Thanop Somprasong Chapter # 8 Advanced SQL.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

Data Stream Management Systems

Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

CQL: A Language for Continuous Queries over Streams and Relations Jennifer Widom Stanford University Joint work with Arvind Arasu & Shivnath Babu stanfordstreamdatamanager.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

CS4432: Database Systems II Query Processing- Part 2.

Data Mining: Concepts and Techniques Mining data streams

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.

1 Introduction to Database Systems, CS420 SQL Views and Indexes.

Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –

Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –

More SQL: Complex Queries, Triggers, Views, and Schema Modification

Mining Data Streams (Part 1)

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

COMP3211 Advanced Databases

The Stream Model Sliding Windows Counting 1’s

Database Systems: Design, Implementation, and Management Tenth Edition

Arvind Arasu, Brian Babcock

Chapter 15 QUERY EXECUTION.

Load Shedding Techniques for Data Stream Systems

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Evaluation of Relational Operations: Other Techniques

Database Systems: Design, Implementation, and Management Tenth Edition

Presentation transcript:

Query Processing, Resource Management, and Approximation in a Data Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager

2 Formula for a Database Research Project Pick a simple but fundamental assumption underlying traditional database systems –Drop it Reconsider all aspects of data management and query processing –Many Ph.D. theses –Prototype from scratch

stanfordstreamdatamanager 3 Following the Formula We followed this formula once before –The LORE project –Dropped assumption: Data has a fixed schema declared in advance –Semistructured data The STREAM Project –Dropped assumption: First load data, then index it, then run queries –Continuous data streams (+ continuous queries)

stanfordstreamdatamanager 4 Data Streams Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications –Network monitoring and traffic engineering –Sensor networks –Telecom call records –Financial applications –Web logs and click-streams –Manufacturing processes DSMSDSMS = Data Stream Management System

stanfordstreamdatamanager 5 DBMS versus DSMS

stanfordstreamdatamanager 6 DBMS versus DSMS Persistent relationsTransient streams (and persistent relations)

stanfordstreamdatamanager 7 DBMS versus DSMS Persistent relations One-time queries Transient streams (and persistent relations) Continuous queries

stanfordstreamdatamanager 8 DBMS versus DSMS Persistent relations One-time queries Random access Transient streams (and persistent relations) Continuous queries Sequential access

stanfordstreamdatamanager 9 DBMS versus DSMS Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data characteristics and arrival patterns

stanfordstreamdatamanager 10 The STREAM System Data streams and stored relations Declarative language for registering continuous queries Flexible query plans Designed to cope with high data rates and query workloads –Graceful approximation when needed –Careful resource allocation and usage Relational, centralized (for now)

stanfordstreamdatamanager 11 Contributions to Date Semantics for continuous queries Query plans Exploiting stream constraints Operator scheduling Approximation techniques Resource allocation to maximize precision Initial running prototype

stanfordstreamdatamanager 12 DSMS Scratch Store The (Simplified) Big Picture Input streams Register Query Streamed Result Stored Result Archive Stored Relations

stanfordstreamdatamanager 13 (Simplified) Network Monitoring Register Monitoring Queries DSMS Scratch Store Network measurements, Packet traces Intrusion Warnings Online Performance Metrics Archive Lookup Tables

stanfordstreamdatamanager 14 Using Conventional DBMS relation inserts triggers materialized viewsData streams as relation inserts, continuous queries as triggers or materialized views Problems with this approach –Inserts are typically batched, high overhead –Expressiveness: simple conditions (triggers), no built-in notion of sequence (views) –No notion of approximation, resource allocation –Current systems don’t scale to large # of triggers –Views don’t provide streamed results But we (and others) plan to compareBut we (and others) plan to compare

stanfordstreamdatamanager 15 Declarative Language for Continuous Queries A distinction between STREAM and the Aurora project –Aurora users directly manipulate one large execution plan –STREAM compiles declarative queries into individual plans, system may merge plans –STREAM also supports direct entry of plans Syntax based on SQL, additional constructs for sliding windows and sampling

stanfordstreamdatamanager 16 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

stanfordstreamdatamanager 17 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 18 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Fulfillments F [Range 1 Day] From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 19 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) Orders O From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 20 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] And F.clerk = “Sue” Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe” And O.customer = “Joe”

stanfordstreamdatamanager 21 Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 22 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

stanfordstreamdatamanager 23 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

stanfordstreamdatamanager 24 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

stanfordstreamdatamanager 25 Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

stanfordstreamdatamanager 26 Semantics of Database Languages An often neglected topic Traditional relational databases are in reasonable shape –Relational algebra  SQL But triggers were a mess The semantics of an innocent-looking continuous query over data streams may not be obvious

stanfordstreamdatamanager 27 A Nonobvious Continuous Query Stream of stock quotes: Stocks(ticker,price) Monitor last 10 minutes of quotes: Select  From Stocks [Range 10 minutes] Is result a relation, a stream, or something else? If a relation, what exactly does it contain? If a stream, how does query differ from: Select  From Stocks [Range 1 minute] or Select  From Stocks [  ]

stanfordstreamdatamanager 28 Our Semantics and Language for Continuous Queries Abstract:Abstract: interpretation for CQs based on certain “black boxes” Concrete:Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences Goals –CQs over multiple streams and relations –Exploit relational semantics to the extent possible –Easy queries should be easy to write, simple queries should do what you expect

stanfordstreamdatamanager 29 Relations and Streams Assume global, discrete, ordered time domain (more on this later) Relation –Maps time T to set-of-tuples R Stream –Set of (tuple,timestamp) elements

stanfordstreamdatamanager 30 Conversions StreamsRelations Window specification Special operators: Istream, Dstream, Rstream Any relational query language

stanfordstreamdatamanager 31 Conversion Definitions Stream-to-relation –S [W] is a relation — at time T it contains all tuples in window W applied to stream S up to T –When W = , contains all tuples in stream S up to T Relation-to-stream –Istream(R) contains all (r,T ) where r  R at time T but r  R at time T–1 –Dstream(R) contains all (r,T ) where r  R at time T–1 but r  R at time T –Rstream(R) contains all (r,T ) where r  R at time T

stanfordstreamdatamanager 32 Abstract Semantics Take any relational query language Can reference streams in place of relations –But must convert to relations using any window specification language ( default window = [  ] ) Can convert relations to streams –For streamed results –For windows over relations (note: converts back to relation)

stanfordstreamdatamanager 33 Query Result at Time T Use all relations at time T Use all streams up to T, converted to relations Compute relational result Convert result to streams if desired

stanfordstreamdatamanager 34 Time Easiest: global system clock –Stream elements and relation updates timestamped on entry to system Application-defined time –Streams and relation updates contain application timestamps, may be out of order –Application generates “heartbeat” Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress –Query results in application time

stanfordstreamdatamanager 35 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [  ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk Maximum-cost order fulfilled by each clerk in last 1000 fulfillments

stanfordstreamdatamanager 36 Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [  ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T

stanfordstreamdatamanager 37 Abstract Semantics – Example 1 Istream Select Istream(F.clerk, Max(O.cost)) From O [  ], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk At time T: entire stream O and last 1000 tuples of F as relations Evaluate query, update result relation at T Streamed result:Streamed result: New element (,T) whenever changes from T–1

stanfordstreamdatamanager 38 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Average price over last day for each stock

stanfordstreamdatamanager 39 Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock Istream provides history of CurPrice Window on history, back to relation, group and aggregate

stanfordstreamdatamanager 40 Concrete Language – CQL Relational query language: SQL Window spec. language derived from SQL-99 –Tuple-based, time-based, partitioned Syntactic shortcuts and defaultsSyntactic shortcuts and defaults –So easy queries are easy to write and simple queries do what you expect EquivalencesEquivalences –Basis for query-rewrite optimizations –Includes all relational equivalences, plus new stream-based ones

stanfordstreamdatamanager 41 Two Extremely Simple CQL Examples Select  From Strm Had better return Strm (It does) –Default  window for Strm –Default Istream for result Select  From Strm, Rel Where Strm.A = Rel.B Often want “ NOW ” window for Strm But may not want as default

stanfordstreamdatamanager 42 Query Execution query planWhen a continuous query is registered, generation a query plan –Users can also register plans directly Plans composed of three main components: –Operators –Operators (as in most conventional DBMS’s) Queues –Inter-operator Queues (as in many conventional DBMS’s) –State (synopses) schedulerGlobal scheduler for plan execution

stanfordstreamdatamanager 43 Operators and State synopsesState (synopses) –Summarize tuples seen so far (exact or approximate) for operators requiring history –To implement windows Example: synopsis join –Sliding-window join –Approximation of full join State 1 State 2 ⋈

stanfordstreamdatamanager 44 Simple Query Plan State 4 ⋈ State 3  Stream 1 Stream 2 Stream 3 Q1Q1 Q2Q2 State 1 State 2 ⋈ Scheduler

stanfordstreamdatamanager 45 Some Issues in Query Plan Generation +/- streamsCompatibility and conversions for streams and relations (+/- streams) State sharing, incremental computation Windowed joinsWindowed joins: Multiway versus 2-way Windows in general: push down, pull up, split, merge, … Time coordination, operator-level heartbeats

stanfordstreamdatamanager 46 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky

stanfordstreamdatamanager 47 Memory Overhead in Query Processing Queues + State Continuous queries keep state indefinitely Online requirements suggest using memory rather than disk –But we realize this assumption is shaky Goal: minimize memory use while providing timely, accurate answersGoal: minimize memory use while providing timely, accurate answers

stanfordstreamdatamanager 48 Reducing Memory Overhead Two main techniques to date constraints on streamsreduce state 1)Exploit constraints on streams to reduce state operator schedulingreduce queue sizes 2)Clever operator scheduling to reduce queue sizes

stanfordstreamdatamanager 49 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01]

stanfordstreamdatamanager 50 Exploiting Stream Constraints arbitraryFor most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state

stanfordstreamdatamanager 51 Exploiting Stream Constraints For most queries, unbounded memory is required for arbitrary streams [PODS ’01] But streams may exhibit constraints that reduce, bound, or even eliminate state Conventional database constraints –Redefined for streams –“Relaxed” for stream environment

stanfordstreamdatamanager 52 Stream Constraints adherence parameterEach constraint type defines adherence parameter k Clustered(k) for attribute S.A Ordered(k) for attribute S.A Referential-Integrity(k) for join S 1  S 2

stanfordstreamdatamanager 53 Algorithm for Exploiting Constraints Input –Any Select-Project-Join query over streams –Any set of k-constraints Output –Query execution plan that reduces or eliminates state based on k-constraints –If constraints violated, get approximate result

stanfordstreamdatamanager 54 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O

stanfordstreamdatamanager 55 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s

stanfordstreamdatamanager 56 Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non- matching F’s Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples

stanfordstreamdatamanager 57 Operator Scheduling Global scheduler invokes run method of query plan operators with “timeslice” parameter Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … –First scheduler: round-robin  Second scheduler: minimize queue sizes –Third scheduler: minimize combination of queue sizes and latency

stanfordstreamdatamanager 58 Scheduling Algorithm Goal: minimize total queue size for unpredictable, bursty stream arrival patterns Proven: within constant factor of any “clairvoyant” strategy for some queries Empirical results: large savings over naive strategies for many queries But minimizing queue sizes is at odds with minimizing latency

stanfordstreamdatamanager 59 Scheduling Algorithm (contd.) ready chain Always schedule ready chain with steepest downslope progress chart Plan progress chart: memory delta per unit time Independent scheduling of operators makes mistakes chains Consider chains of operators Memory Op1 Op2 Op3 Op4 Computation time

stanfordstreamdatamanager 60 Approximation Why approximate? –Memory requirement too high, even with constraints and clever scheduling –Can’t process streams fast enough for query load

stanfordstreamdatamanager 61 Approximation (cont’d) Static:Static: rewrite queries to add (or shrink) sampling or windows +User can participate, predictable behavior –Doesn’t consider dynamic conditions Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding +Adapts to current resource availability (major open issue) –How to convey to user? (major open issue)

stanfordstreamdatamanager 62 Resource Allocation to Maximize Precision Query plan: tree of operators function from resource allocation R to precision PEach operator has function from resource allocation R to precision P Simple (FP,FN) precision model –FP = probability of answer being false positive –FN = # false negatives per correct answer

stanfordstreamdatamanager 63 Precision of an Operator State 1 State 2 ⋈ Approx. Answer Exact Answer Precision: FP = False Positives FP = False Positives FN = False Negatives FN = False Negatives

stanfordstreamdatamanager 64 Precision of Two Operators (FP1,FN1) State 2 State 1 ⋈ Approx. Answer Exact Answer State 3  Approx. Answer Exact Answer (FP2,FN2)

stanfordstreamdatamanager 65 The Problem Statement Given a plan, precision function for each operator in the plan, and R total resources, allocate R to operators to maximize result precision Solution –For each operator type: formula for calculating output precision given input precision and operator resource allocation –Assume precision for input streams –Becomes optimization problem

stanfordstreamdatamanager 66 The Holy Grail Given: –Declarative query –Resources –Constraints on streams Generate plan and resource allocation that takes advantage of constraints and maximizes precision Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user

stanfordstreamdatamanager 67 The Stream Systems Landscape (At least) three general-purpose DSMS prototypes underway –STREAM –STREAM (Stanford) –Aurora –Aurora (Brown, Brandeis, MIT) –TelegraphCQ –TelegraphCQ (Berkeley) All will be demo’d at SIGMOD ’03 stream system benchmarkCooperating to develop stream system benchmark  Goal: demonstrate that conventional systems are far inferior for data stream applications

stanfordstreamdatamanager 68 Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma