Query Processing, Resource Management, and Approximation in a Data Stream Management System.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Understanding Operating Systems Fifth Edition
Maintaining Sliding Widow Skylines on Data Streams.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Xyleme A Dynamic Warehouse for XML Data of the Web.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Tutorial 6 & 7 Symbol Table
Building a Data Stream Management System Prof. Jennifer Widom Joint project with Prof. Rajeev Motwani and a team of graduate studentshttp://www-db.stanford.edu/stream.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
8-1 Outline  Overview of Physical Database Design  File Structures  Query Optimization  Index Selection  Additional Choices in Physical Database Design.
Database Systems More SQL Database Design -- More SQL1.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Query Processing Presented by Aung S. Win.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
STREAM The Stanford Data Stream Management System.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
CSCE Database Systems Chapter 15: Query Execution 1.
Database Management 9. course. Execution of queries.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君 吳哲維 林冠良.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
CS4432: Database Systems II Query Processing- Part 2.
Lec 7 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.
Query Processing CS 405G Introduction to Database Systems.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Practical Database Design and Tuning
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Parallel Databases.
A paper on Join Synopses for Approximate Query Answering
Chapter 15 QUERY EXECUTION.
Practical Database Design and Tuning
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Evaluation of Relational Operations: Other Techniques
Adaptive Query Processing (Background)
Presentation transcript:

Query Processing, Resource Management, and Approximation in a Data Stream Management System

Introduction Two fundamental differences between DSMS and DBMS 1.In addition to managing traditional stored data such as relations, a DSMS must handle multiple continuous, unbounded, possibly rapid and time-varying data streams. 2.Due to the continuous nature of the data, a DSMS typically supports long- running continuous queries, which are expected to produce answers in a continuous and timely fashion. Goal – Building general purpose DSMS –supports a declarative query language and can cope with high data rates and thousands of continuous queries –multi-query optimization, judicious resource allocation, and sophisticated scheduling to achieve high performance –environments where data rates and query load may exceed available resources In these cases the system is designed to provide approximate answers to continuous queries.

Query Language CQL: Continuous Query Language. –All queries are continuous, as opposed to the one-time queries supported by an DBMS STREAM vs Relation –Streams have the notion of an arrival order, they are unbounded, and they are append-only. A stream can be thought of as a set of pairs indicating that a tuple, s, arrives on the stream at time, t. Streams may result from queries or subqueries –Relations are unordered, and they support updates and deletions as well as insertions (all of which are timestamped). In addition to relations stored by the DSMS, relations may result from queries or subqueries CQL: From clause may be followed by an optional sliding window specification, enclosed in brackets, and an optional sampling clause –a window specification consists of an optional partitioning clause (Partition By), a mandatory window size (Rows, Range), and an optional filtering predicate (WEHER) –A sampling clause (Sample) specifies that a random sample of the data elements from the stream

Examples Q1:Select Count(*) From Requests S [Range 1 Day Preceding] Where S.domain = ‘stanford.edu’ Q2:Select Count(*) From Requests S [Partition By S.client_id Rows 10 Preceding Where S.domain = ‘stanford.edu’] Where S.URL Like ‘ Q3:Select T.URL From (Select client_id, URL From Requests S, Domains R Where S.domain = R.domain And R.type = ’commerce’) T Sample(10) Where T.client_id Between 1 And 1000

Formal Semantics A window specification is applied to a stream up to a specific time t and the result is a finite set of tuples which is treated as a relation. Istream (for “insert stream”) applied to relation R contains a stream element whenever tuple s is in R at time t but not in at time t-1 Dstream (for “delete stream”) applied to relation R contains a stream element whenever tuple s is in at time t-1 but not in at time t Using the mappings in the figure, CQL queries can freely mix relations and streams. The result of a query at time t is obtained by taking all relations at time t, all streams up to time t converted to relations by their window specifications, and applying conventional relational semantics. If the outermost operator is Istream or Dstream then the query result is converted to a stream, otherwise it remains as a relation.

Stream Ordering and Timestamps Assumptions –So that we can evaluate row-based (Rows) and timebased (Range) sliding windows, all stream elements arrive in order, timestamped according to a global clock. –So that we can coordinate streams with relation states, all relation updates are timestamped according to the same global clock as streams. –So that we can generate query results, the global clock provides periodic “heartbeats” that tell us when no further stream elements or relation updates will occur with a timestamp lower than the heartbeat value.

Inactive and Weighted Queries Inactive query –When a query is inactive, the system may not maintain the answer to the query as new data arrives. –an inactive query may be activated at any time –Influences decisions about query plans and resource allocation Weight –Indicating relative importance of a query –Weights might also influence scheduling decisions –inactive queries may be thought of as queries with negligible weight

Query Plans Three different types of components of a query plan –Query operators, similar to a traditional DBMS. Each operator reads a stream of tuples from a set of input queues, processes the tuples based on its semantics, and writes its output tuples into a single output queue. –Inter-operator queues, also similar to the approach taken by some traditional DBMS’s. Queues connect different operators and define the paths along which tuples flow as they are being processed. –Synopses, used to maintain state associated with operators

Synopsis A synopsis summarizes the tuples seen so far at some intermediate operator in a running query plan, as needed for future evaluation of that operator. For full precision a join operator must remember all the tuples it has seen so far on each of its input streams, so it maintains one synopsis for each Synopsis sizes grow without bound if full precision is expected in the query result An important feature to support is synopses that use some kind of summarization technique to limit their size –fixed-size hash tables, sliding windows, reservoir samples, quantile estimates, and histograms.

Generic methods of Operator and Synopsis Operator Class –create, with parameters specifying the input queues, output queue, and initial memory allocation. –changeMem, with a parameter indicating a dynamic decrease or increase in allocated memory. –run, with a parameter indicating how much work the operator should perform before returning control to the scheduler Synopsis Class –create, with a parameter specifying an initial memory allocation. –changeMem, with a parameter indicating a dynamic decrease or increase in allocated memory. –insert and delete, with a parameter indicating the data element to be inserted into or deleted from the synopsis. –query, whose parameters and behavior depend on the synopsis type. For example, in a hash-table synopsis this method might look for matching tuples with a particular key value, while for a sliding window synopsis this method might support a full window scan.

An Example of query plans for Q1 and Q2 Two plans contain three operators Two plans contain four synopses Two plans contain four queues Two plans share a subplan joining streams R and S Execution of query operators is controlled by a global scheduler

Resource Sharing in Query Plans Important topics are yet to be addressed –For now we are considering resource sharing and approximation separately. That is, we do not introduce sharing that intrinsically introduces approximate query results, such as merging subexpressions with different window sizes, sampling rates, or filters. Doing so may be a very effective technique when resources are limited, but we have not yet explored it in sufficient depth to report here. –Our techniques so far are based on exact common subexpressions. Detecting and exploiting subexpression containment is a topic of future work that poses some novel challenges due to window specifications, timestamps and ordering, and sampling in our query language. A shared queue maintains a pointer to the first unread tuple for each operator that reads from the queue, and it discards tuples once they have been read by all parent operators. If two queries with a common subexpression produce parent operators with very different consumption rates, then it may be preferable not to use a shared subplan.

Synopsis Sharing When several operators read from the same queue, and when more than one of those operators builds some kind of synopsis, then it may be beneficial to introduce synopsis sharing –Which operator is responsible for managing the shared synopsis (e.g., allocating memory, inserting tuples)? –If the synopses required by the different operators are not of identical types or sizes, is there a theory of “synopsis subsumption” (and synopsis overlap) that we can rely on? –If the synopses are identical, how do we cope with the different rates at which operators may “consume” data in the synopses?

Resource Management Relevant resources in a DSMS –memory, computation, I/O if disk is used, and network bandwidth Focusing initially on memory consumed by query plan synopses and queues –in many cases reducing memory overhead has a natural side-effect of reducing other resource requirements as well Resource Management –An algorithm for incorporating known constraints on input data streams to reduce synopsis sizes –An algorithm for operator scheduling that minimizes queue sizes

Exploiting Constraints Over Data Streams Additional information about streams –gathering statistics over time –constraint specifications at stream-registration time Use this information to reduce resource requirements without sacrificing query result precision –An alternate and more dynamic technique is for the streams to contain punctuations, which specify run-time constraints Example [Join Order(order_id) and Fulfillments(item_id)] –In the general case, requires synopses of unbounded size –What if we know that all tuples for a given orderID and itemID arrive on O before the corresponding tuples arrive on F? –We need not maintain a join synopsis for the F operand at all –In practice, constraints may not be adhered to by data streams strictly no more than k tuples with a different orderID appear between two tuples with same orderID

Scheduling (1) Query plans are executed via a global scheduler, which calls the run methods of query plan operators Example –O1 operates on input queue q1 and writing results to queue q2 which is the input queue of O2 –O1 takes one time unit to operate on a batch of n tuples from q1 and it has 20% selectivity –O2 takes one time unit to operate on tuples and let us assume that its output is not queued by the system (final result)

Scheduling (2) Two possible scheduling strategies –Tuples are processed to completion in the order they arrive at q1 Each batch of n tuples in q1 is processed by O1 and then O2 based on arrival time, requiring two time units overall. –If there is a batch of n tuples in q1, then O1 operates on them using one time unit, producing n/5 new tuples in q2. Otherwise, if there are any tuples in q2, then up to n/5 of these tuples are operated on by Q2, requiring one time unit.

Approximations The system will not be able to provide continuous and timely exact answers to all registered queries –multiple unbounded and possibly rapid incoming data streams –multiple complex continuous queries with timeliness requirements –finite computation and memory resources

Static Approximation Window Reduction –two considerations If W is a duplicate-elimination operator, then shrinking W’s window can actually increase its output rate If W is part of the right-hand subtree of a negation construct (e.g., NOT EXISTS or EXCEPT), then reducing the size of output may have the effect of increasing output further up the query plan Sampling Rate Reduction –Although changing the sampling rate at an operator will not reduce the resource requirements of it will reduce the output rate.

Dynamic Approximation Synopsis Compression –maintaining a sample of the intended synopsis content –using histograms for aggregation –compressed wavelets for aggregation –using Bloom filters for duplicate elimination, set difference, or set intersection Sampling and Load Shedding –approximation techniques that reduce queue sizes –introduce one or more sample operators into the query plan, or to reduce the sampling rate at existing operators –simply drop tuples from queues when the queues grow too large, a technique sometimes referred to as load shedding –load shedding may drop chunks of tuples at a time

Resource Management and Approximation: Discussion Many important challenges –We need a means of monitoring synopsis and queue sizes and determining when dynamic reduction measures (e.g., window size reduction, load shedding) should kick in –Even if we have a good algorithm for initial allocation of memory to synopses and queues, we need a reallocation algorithm to handle the inevitable changes in data rates and distributions –The ability to add, delete, activate, and deactivate queries at any time forces all resource allocation schemes, including static ones, to provide a means of making incremental changes