Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Dimensional Modeling.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Memory Management (II)
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
PSoup Kevin Menard CS 561 4/11/2005. Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 with Michael J. Franklin.
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Memory Management Last Update: July 31, 2014 Memory Management1.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
施賀傑 何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
CSCE Database Systems Chapter 15: Query Execution 1.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Storage and Indexing1 Overview of Storage and Indexing.
VIRTUAL MEMORY By Thi Nguyen. Motivation  In early time, the main memory was not large enough to store and execute complex program as higher level languages.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Methodology – Physical Database Design for Relational Databases.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
11th International Conference on Web-Age Information Management July 15-17, 2010 Jiuzhaigou, China V Locking Protocol for Materialized Aggregate Join Views.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
APRIL 13 th Introduction About me Duško Mirković 7 years of experience.
1 VLDB, Background What is important for the user.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Storage and File Organization
Practical Database Design and Tuning
Memory Management.
Module 11: File Structure
Temporal Indexing MVBT.
Temporal Indexing MVBT.
Relational Algebra Chapter 4, Part A
Lecture 12 Lecture 12: Indexing.
Practical Database Design and Tuning
ICOM 5016 – Introduction to Database Systems
TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
PSoup: A System for streaming queries over streaming data
Adaptive Query Processing (Background)
Presentation transcript:

Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson

About Me 3 rd Year ISYE major Minor in Computer Science From Austin, TX Have visited every state but Alaska Intern at Deloitte Consulting focusing on SAP implementation

Agenda Background/Motivation PSoup Introduction System Overview Query Processing Techniques Implementation Performance Aggregation Queries Conclusions Critique

Background/Motivation Continuous Query (CQ) Systems Treat queries as fixed entities and stream data over them Previous systems only allowed streaming of either data or queries Continuously deliver results as they are computed (infeasible/inefficient) Data Recharging Monitoring

PSoup: Introduction Query processor based on Telegraph query processing framework Allows both data and queries to be streamed Partially stores results to support disconnected operation and improve data throughput and response time

PSoup: System Overview User initially registers query specification with system System returns handle which can be used to invoke results of query later Example Query: SELECT * FROM Data_Stream D_s WHERE (D_s.a y) BEGIN(NOW – 10) END(NOW); Begin-End Clause allows: Snapshot (constant beginning and ending time) Landmark (constant beginning and variable ending time) Sliding window (variable beginning and ending time) Limited by size of memory

PSoup: System Overview PSoup treats execution of query streams as a join of query and data streams Maintains State Modules (SteMs) for queries and data One query SteM for all queries in the system, and one data SteM for each data stream

PSoup: Query Processing Techniques Overview PSoup assigns unique queryID that it returns to the user Client can disconnect, reconnect and execute query to obtain updated results PSoup continuously matches data to query predicates in background and stores the results in its Results Structure When a query is invoked, PSoup applies the appropriate input window to the Results Structure to return the current results

PSoup: Query Processing Techniques Entry of new Query specs New queries split into two parts: Standing Query Clause (SQC): consists of the SELECT-FROM-WHERE clauses BEGIN-END clause, stored in separate WindowsTable structure SQC inserted into Query SteM Used to probe Data SteMs corresponding to tables in FROM clause Resulting tuples stored in Results Structure

PSoup: Query Processing Techniques Entry of new data New tuples assigned globally unique tupleID and physical timestamp (physicalID) based on system clock Inserted into appropriate Data SteM Then used to probe Query SteM to determine which SQCs it satisfies TupleIDs and physicalIDs stored in Results Structure

PSoup: Query Processing Techniques Selection Queries over a single stream

PSoup: Query Processing Techniques Join Queries Over Multiple Streams

PSoup: Query Processing Techniques Query Invocation and Result Construction Results Structure maintains info about which tuples in Data SteM(s) satisfy which SQCs in Query SteM For each result tuple of each query, it stores tupleID and physicalID of all constituent base tuples of result tuple Results of a query can be accessed by its queryID Ordered by timestamp (physicalID)

PSoup: Implementation Eddy Each tuple has a predicate attribute and an Interest List dictating where it is to be routed Provides Stream Prefix Consistency by storing new and temporary tuples separately in New Tuple Pool and Temporary Tuple Pool Begins by selecting a tuple from the NTP and then processing everything in the TTP before pickign another tuple from the NTP

PSoup: Implementation Data SteM Use tree-based index for data to provide efficient access to probing queries One red-black tree for every attribute Maintains hash-based index over tupleIDs for fast access

PSoup: Implementation Query SteM Allows sharing of work between queries that have overlapping FROM clauses Use red-black trees to index single-attribute single- relation boolean factors of a query

PSoup: Implementation Query SteM For queries involving joins of multiple attributes, tree structure doesn’t work Instead, a linked list called the predicateList is used Query SteM contains an array in which each cell represents a query At beginning of probe by a data tuple, each cell is set to the number of boolean factors in corresponding query Every time tuple satisfies a boolean factor, cell value is decremented At end of probe, if cell = 0, that means the data tuple satisfies the given query

PSoup: Implementation Results Structure Stores metadata indicating which tuples satisfy which SQCs Can either be accomplished by previously- mentioned bitmap or by associating a linked list of satisfactory data tuples for each query Ordering by timestamp is simple for single- table queries For Join queries, typically use oldest timestamp

PSoup: Performance Implemented in Java with customized versions of Eddy and SteMs Examined performance of two versions: PSoup-Partial (PSoup-P): Maintain results corresponding to SQCs in Results Structure, and apply BEGIN-END clauses to retrieve current results on query invocation PSoup-Complete (PSoup-C): Continuously maintains results corresponding to current input window for each query in linked lists NoMat: Measurements of a system that doesn’t materialize results

PSoup: Performance Storage Requirements NoMat: Storage cost = space taken to store base data streams within maximum window over which queries are supported, plus size of structures PSoup-P: Storage cost = storage cost of NoMat + size of Results Structure (either bitarray or linked-list) PSoup-C: Storage cost >> storage cost of PSoup-P since C always stores current results of standing queries at a given time

PSoup: Performance Experimental Setup Varied window sizes ( ) and number(1- 8)/type of boolean factors Measured response time and maximum supportable data arrival rate Examined both P and C with and without predicate indexes Tested scheme to remove redundancies arising from joins Used synthetic generated query( ) /data streams

PSoup: Performance Response Time vs. Window Size

PSoup: Performance Response Time vs. # Interval Predicates

PSoup: Performance Data Arrival Rate vs. # SQCs

PSoup: Performance Summary of Results Materializing results of queries supports higher query invocation rates Indexing queries and lazily applying windows improves maximum data throughput PSoup-C requires more memory PSoup-C optimizes query invocation rate PSoup-P optimizes data arrival rate

PSoup: Performance Removing Redundancy in Join processing Entry of a query specification or new data Composite tuples in joins

PSoup: Aggregation Queries PSoup can support aggregate functions Only possible to share data structures across queries with identical SELECT- PROJECT-JOIN clause

PSoup: Conclusions Treats data and query streams analogously Can support queries that require access to data that arrived before and after the query Materializes results to cut down on response time and to support disconnected operation Enables data recharging and monitoring Future work: Write data streams to disk and execute queries over them Transfer queries between disk and memory, allowing query execution to be scheduled Confront resource constraints when dealing with infinite streams Query browser for temporal data

Critique Strengths Very well written, easy to follow Clear examples, excellent explanation of performance results Strong method that reduces processing time with increase in interval predicates Weaknesses Lacking sufficient data on storage costs Experimentation only tested one multiple-relation boolean factor for joins; unrealistic Didn’t address whether same (or similar) query could be entered twice and accidentally given two ID’s