1 NiagaraCQ: A Scalable Continuous Query System for Internet Databases CS561 Presentation Xiaoning Wang.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
CS 540 Database Management Systems
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Segmentation and Paging Considerations
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
On Reducing Communication Cost for Distributed Query Monitoring Systems. Fuyu Liu, Kien A. Hua, Fei Xie MDM 2008 Alex Papadimitriou.
Memory Management (II)
Chapter 8 File organization and Indices.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
NiagaraCQ A Scalable Continuous Query System for Internet Databases.
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
Computer Organization and Architecture
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Event-Condition-Action Rule Languages over Semistructured Data George Papamarkos.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Database Management 9. course. Execution of queries.
Data Streams and Continuous Query Systems
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Computer Studies (AL) Memory Management Virtual Memory I.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
CS4432: Database Systems II Query Processing- Part 2.
Lec 7 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases Jianjun Chen et al Computer Sciences Dept. University of Wisconsin-Madison SIGMOD.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Practical Database Design and Tuning
CS 540 Database Management Systems
CS 440 Database Management Systems
Parallel Databases.
Database Management System
SOFTWARE DESIGN AND ARCHITECTURE
NiagaraCQ : A Scalable Continuous Query System for Internet Databases
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Practical Database Design and Tuning
Lecture 2- Query Processing (continued)
Chapter 12 Query Processing (1)
Overview of Query Evaluation
Evaluation of Relational Operations: Other Techniques
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Chapter 11: Indexing and Hashing
COMP755 Advanced Operating Systems
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

1 NiagaraCQ: A Scalable Continuous Query System for Internet Databases CS561 Presentation Xiaoning Wang

2 Presentation Outline Why NiagaraCQ? What’s NiagaraCQ? Details Experiments

3 We need Continuous Queries Allows users to obtain new results from a database without having to issue the same query repeatedly. Especially useful in an environment like the Internet comprised of large amounts of frequently changing information. Need to be able to support millions of queries due to the scale of the Internet.

4 What’s NiagaraCQ? A distributed database system for querying distributed XML data sets using a query language like XML-QL. It allows a very large number of users to be able to register continuous queries in a high- level query language such as XML-QL.

5 Anything else? Novelty: query optimization strategy

6 What’s the Approach Group!!! Assumption ?

7 Advantages of Grouping Group optimization has the following benefits: Grouped queries can share computations. Common execution plans of grouped queries can reside in memory, significantly saving on I/O costs compared to executing each query separately. Grouping makes it possible to test the “firing” conditions of many continuous queries together, avoiding unnecessary invocations.

8 Can this traditional grouping tech be applied into Continuous Query directly? NO!!!

9 Why not? Previous group optimization efforts focused on finding an optimal plan for a small number of similar queries. Not applicable to a continuous query system for the following reasons: Computationally too expensive to handle a large number of queries. Not designed for an environment like the web.

10 Incremental group optimization approach Uniqueness of NiagaraCQ

11 NiagaraCQ Supports scalable continuous query processing over multiple, distributed XML files by deploying incremental group optimization.

12 It’s a novel approach in which queries are grouped according to their signatures. Groups are created for existing queries according to their signatures, which represent similar structures among the queries. Groups allow the common parts of two or more queries to be shared. Each individual query in a query group shares the results from the execution of the group plan. Incremental group optimization

13 Cont. When a new query arrives, existing groups are considered as possible optimization choices instead of re-grouping all the queries in the system. New query is merged into existing groups whose signatures match that of the query. Existing queries are not re-grouped

14 query-split scheme Incremental group optimization scheme employs a query-split scheme. After the signature of a new query is matched, the sub- plan corresponding to the signature is replaced with a scan of the output file produced by the matching group. Optimization process then continues with the remainder of the query tree is a bottom-up fashion until the entire query has been analyzed. If no group “matches” a signature of the new query, a new query group for this signature is created in the system. Thus, each continuous query is split into several smaller queries such that inputs of each of these queries are monitored using the same techniques that are used for the inputs of user-defined continuous queries.

15 Cont. Advantages for query-split scheme: Main advantage is that it can be implemented using a general query engine with only minor modifications. Anther advantage is that the approach is very scalable.

16 Cont. Since queries are continuous being added and removed from groups, over time the quality of the group can deteriorate, leading to a reduction in the overall performance of the system. One or more groups may require “dynamic regrouping” to re-establish their effectiveness.

17 Types of Continuous Queries Change-base continuous queries Fired as soon as new relevant data becomes available. i.e. In stocks, day-traders would probably want to know the desired price information immediately. Provide better response time, but waste system resources when instantaneous answers are not really required.

18 Types of Continuous Queries Timer-based continuous queries Executed only as time intervals specified by the submitting user. i.e. In stocks, long-term investors may be satisfied being notified every hour. Can be supported more efficiently, thus query systems that support these continuous queries should be more scalable. However, since users can specify various overlapping time intervals for their continuous queries, grouping timer-based queries is much more difficult than grouping purely change- based queries.

19 NiagaraCQ (cont.) Considering only the changed portion of each updated XML file and not the entire file. Monitor and detect data source changes using both push and pull models on heterogeneous sources. Caching mechanism is used to obtain good performance with limited amounts of memory.

20 Using Expression Signature Represent the same syntax structure, but possibly different constant values, in different queries. Specific implementation of the signature concept.

21 Group Group signature Quotes.Quote.Symbol in quotes.xml constant =

22 Group Group constant table …. INTCDest.i MSFTDest.j …. Constant_valueDestination_buffer

23 Group plan Previous query plan Quotes.xml File Scan Select Symbol=“INTC ” Trigger Action ITrigger Action J

24 Group plan New plan Quotes.xmlConstant Table File Scan Join Trigger Action ITrigger Action J Symbol=Constant_value Split ……

25 Buffer design The destination buffer for the split operator is needed. Pipelined scheme Intermediate scheme

26 Pipelined scheme Split Operator ……. buffer Operator buffer

27 Disadvantage of pipeline Such a scheme doesn’t work for grouping timer- based CQ’s. It’s difficult for a split operator to determine which tuple should be stored and how long they should be stored for. The query structure is a directed graph, not a tree and hence the plan may be too complicated for a general XML-QL query engine to execute The combine plan may be very large requires resources beyond the limits of system. A large portion of the query plan may not need to be executed at each query invocation. One query may block many other queries.

28 Materialized Intermediate Files Query Split with Materialized Intermediate Files Split Trig.Act.J File Scan Trig.Act.J …. File_i File_j

29 Advantages for this design Every query is scheduled independently. Only the necessary queries are executed. Queries after the Split operator will be in standard, tree- structured query format and thus can be scheduled and executed by a general query engine. Each query in the system is about the size of a common user query, so that it can be executed without consuming an unusual amount of system resources. This approach handles intermediate files and original data files uniformly. The potential bottleneck of pipelined approach is avoided.

30 General Selection Predicates Format: “Attribute op Constant” Attribute is a path expression without wildcards in it. Op includes “-”, “ ”, because these formats dominate in the selection queries.

31 Problem for range-query groups One potential problem for range-query groups is that the intermediate files may contain a large number of duplicated tuples because range predicated of the different queries might overlap.

32 Solution: Virtual Intermediate Files Split Trig.Act.J File Scan Trig.Act.J …. Virtual Intermediate File_i VI File_i Real Intermediate File …… Only one

33 Join Operators Join operators are usually expensive, sharing common join operations can significantly reduce the amount of computation.

34 Join first? Or selection first? Advantage Disadvantage Solution: choose a better one based on a cost model

35 Timer-based Continuous Queries Grouped in the same way as change-based queries except that the time information needs to be recorded at installation time. Challenges Hard to monitor the timer events of those queries. Sharing the common computation becomes difficult due to the various time intervals. Timer-based continuous queries fires at specific times, but only if the corresponding input files have not been modified.

36 Incremental Evaluation Incremental evaluation allows queries to be invoked only on the changed data. For each file, on which CQ’s are defined, NiagaraCQ keeps a “delta file” that contains recent changes. Queries are run over the delta files whenever possible instead of their original files. A time stamp is added to each tuple in the delta file. NiagaraCQ fetches only tuples that were added to the delta file since the query’s last firing time.

37 Memory Caching Caching is used to obtain good performance with a limited amount of memory. Caches query plans, system data structures, and data files for better performance.

38 What should be cached? Grouped query plans. We assume that the number of query groups is relatively small. Recently accessed file. The event list for monitoring the timer- based events. But it can be large, so we keep only a “time window” of this list

39 Experimental Results Hardware Settings Conducted on a Sun Ultra 6000 with 1GB of RAM, running JDK1.2 on Solaris 2.6. Data Sets Database of stock information consisting of two XML files, “quotes.xml” and “companies.xml”. “quotes.xml” contains stock information on about 5000 NASDAQ companies. The size of the file is about 2 MB. “companies.xml” contains related company information. The size of the file is about 1 MB.

40 Comparison

41 Comparison

42 System Status and Future Work A prototype version of NiagaraCQ has been developed, which includes a Group Optimizer, Continuous Query Manager, Event Detector, and Data Manager. Incremental group optimizer is still at a preliminary stage. However, incremental group optimization has been shown to be a promising way to achieve good performance and scalability. Intend to extent incremental group optimization to queries containing operators other than selection and join (i.e. aggregation).