Data Streams and Continuous Query Systems

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

CS 540 Database Management Systems

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

Fjording the Stream: An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael J. Franklin University of California, Berkeley Proceedings.

PSoup Kevin Menard CS 561 4/11/2005. Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 with Michael J. Franklin.

NiagaraCQ A Scalable Continuous Query System for Internet Databases.

1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman Presented by: Bhuvan.

1 NiagaraCQ: A Scalable Continuous Query System for Internet Databases CS561 Presentation Xiaoning Wang.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.

What is Concurrent Programming? Maram Bani Younes.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Computer System Architectures Computer System Software

Native Support for Web Services  Native Web services access  Enables cross platform interoperability  Reduces middle-tier dependency (no IIS)  Simplifies.

NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.

施賀傑何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

COMP 410 Update. The Problems Story Time! Describe the Hurricane Problem Do this with pictures, lots of people, a hurricane, trucks, medicine all disconnected.

Module 7 Reading SQL Server® 2008 R2 Execution Plans.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Runtime Optimization of Continuous Queries Balakumar K. Kendai and Sharma Chakravarthy Information Technology Laboratory Department of Computer Science.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)

Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson.

INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.

An Example Data Stream Management System: TelegraphCQ INF5100, Autumn 2009 Jarle Søberg.

CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden SIGMOD 2002 June 4, 2002 With Mehul Shah, Joseph Hellerstein, and Vijayshankar.

Eddies: Continuously Adaptive Query Processing Ross Rosemark.

CS4432: Database Systems II Query Processing- Part 2.

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Query Processing CS 405G Introduction to Database Systems.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases Jianjun Chen et al Computer Sciences Dept. University of Wisconsin-Madison SIGMOD.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

NiagaraCQ : A Scalable Continuous Query System for Internet Databases

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Chapter 12: Query Processing

Relational Algebra Chapter 4, Part A

Chapter 15 QUERY EXECUTION.

Evaluation of Relational Operations: Other Operations

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Evaluation of Relational Operations: Other Techniques

TelegraphCQ: Continuous Dataflow Processing for an Uncertain World

PSoup: A System for streaming queries over streaming data

Adaptive Query Processing (Background)

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Evaluation of Relational Operations: Other Techniques

Presentation transcript:

Data Streams and Continuous Query Systems CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Outline Review of Data Streams NiagaraCQ TelegraphCQ Conclusion Bibliography

Data Sets VS Data Streams Infrequently changing data Ex. Employee personnel table, contact database, library system Data Streams Data arriving continuously Ex. Stock streamer, sensor networks, weather monitoring system

Traditional Database Query In a traditional query, the query engine returns a subset of the data that is currently in the system. End User / Application Query Results Query Processor Static Data Sets

Continuous Queries Continuous queries are persistent queries that allow users to get new results as new information enters the system. End User / Application Query Processor Data Streams Query Continuous Results Workspace

Niagara CQ

Goal: Allow users to obtain new results from a database without having to issue the same query repeatedly. Develop a system that will allow a large number of users to be able to register continuous queries using a high level language like XML-QL

What’s wrong with previous continuous querying systems? Previous group optimization efforts focused on finding an optimal plan for a small number of queries. Computationally too expensive to handle a handle a large number of queries Not designed for the web, which is constantly changing

Benefits of NiagaraCQ Based on group optimization Grouped queries can share computation Common execution plans of grouped queries reside in memory, saving on I/O costs compared to executing each query separately.

How do we get the benefits? Incremental group optimization Groups are created for existing queries according to their signatures, which represent similar structures among the queries. Each individual query in a query group shares the results from the execution of the group plan When a new query is submitted, the group optimizer considers existing groups as potential optimization choices, the new query is merged into an existing group

Example: XML-QL query Expression signature = Quotes.Quote.Symbol in quotes.xml constant

Query Plan Query plan Trigger Action I Trigger Action J Select Symbol=“INTC” Select Symbol=“MSFT” File Scan File Scan Quotes.xml Quotes.xml

Group Plan Group plan …… Trigger Action I Trigger Action J Split Join Symbol=Constant value File Scan File Scan Quotes.xml Constant Table

Materialized Intermediate Files Query split with intermediate files Trigger_Act_j Trigger_Act_i …. File Scan File Scan File_i File_j Split

General Selection Predicates “Attribute op Constant” Attribute = path expression without wildcards Op = “=“, “<“, “>”…

Join Operators A join signature in or approach contains the names of the two data sources and the predicated for the join. Join queries are grouped with the same join signature.

Processing Continuous Queries 1. CQM adds continuous queries with file and timer information to enable ED to monitor events 2. ED asks DM to monitor changes to files 3.When a timer event happens, ED asks DM the last modified time of files 4.DM informs ED of changes to push-based data sources 5.If file changes and timer events are satisfied. ED provides CQM with a list of firing CQs 6.CQM invokes QE to execute firing CQs. 7.File scan operator calls DM to retrieve selected documents. 8.DM only returns data changes between last fire time and current fire time.

Experimental results Peformed on a Sun Ultra 6000 with 1GB RAM running JDK1.2 on Solaris 2.6

Experimental Results

Analysis of NiagaraCQ Pros Scalable to large number of queries, users Works with both change and timer based continuous queries Better performance, less I/O required to execute queries. Cons No dynamic re-grouping of groups, eventually, groups become sub-optimal Assumes queries have common structure, not always the case Incremental grouping works only for select and join as of now. Eventually, aggregation may be included.

Niagara in Review The goal was to develop an Internet-scale continuous query system using group optimization based on the assumption that many continuous queries on the Internet will have some similarities. Proposed novel “incremental grouping” methodology Supports both timer-based and changed based queries.

TelegraphCQ

TelegraphCQ Design Overview Focus: Continuously Adaptive Query Processing of high volume and highly variable data streams. Large scale Deeply networked nature Unpredictability of the environment Need for close user interaction Data constantly moving and changing

TelegraphCQ Restrictions Data is pushed to the query processor Data arrival rate can be high and bursty On the fly processing, data can be stored, but real-time one pass analysis is important Ordering of data is of significant importance.

Design Goals scheduling and resource management for groups of queries support for out-of-core (non main memory) data variable adaptivity dynamic QoS support parallel cluster-based processing and distributed computation.

TelegraphCQ Complete Redesign and Re-implementation of Telegraph system with focus on focus on support for shared, continuous query processing over query and data streams. Distinguish it from the Telegraph project’s broader focus on adaptive dataflow in general, and to emphasize the challenges we are addressing in our new implementation.

Telegraph Module Types Ingress and Caching Interface with external data sources TeSS – HTML/XML Screenscraper TelNape – Interfaces with popular P2P networks Local caching to hide network delays Query Processing pipelined, non-blocking versions of standard relational operators such as joins, selections, projections, grouping and aggregation, and duplicate elimination. State Module (SteMs) Adaptive Routing ability to “re-optimize” the plan on a continuous basis while a queryis running. Eddies Flux (Fault-tolerant, Load-balancing eXchange): Opaque dataflow module handles buffering and reordering of streams

Eddies Role: Continuously route tuples among a set of other modules according to a routing policy Intercept tuples and choose the order that they travel between modules Eddy can shut down each module when the end of all of its input streams is reached and the modules have completed current processing. Not designed as general purpose scheduler, no enforcement of resource management policies Multiple eddies run as parallel threads on queries with disjoint sets of tables and streams.

Adaptive Processing W/Eddies & SteMs SteM - temporary repository of tuples, essentially corresponding to half of a traditional join operator. It stores homogeneous tuples (i.e., tuples spanning the same set of tables) formed during query processing. Supports insert (build), search (probe), and optionally delete (eviction) operations. Two kinds of tuples can be routed to a SteM. When a tuple t in T (a build tuple) is routed to SteMT , t is added to the set of tuples in SteMT. When a tuple p ∉ T (a probe tuple) is routed to SteMT , SteMT returns concatenated matches for it to the Eddy. These concatenated matches are the tuples in {p} join SteMT that satisfy all query predicates that can be evaluated on the columns in p and T. SteMsS SteMsT ST matches S probe T probe Eddy S build T build S T

Fjords Inter Module Communications API Form the links between modules Supports a mixture of Push (streaming) and Pull (static) operations for query plans Allows modules to ignore the specifics of the data source. Supports non-Blocking dequeue operations

System Specifications Build on PostgreSQL platform process per connection model Coded in C/C++

Example: Landmark Query The input windows of these queries have a fixed beginning point in the timeline, and a forward moving endpoint. Example: “Select all the days after the hundredth trading day, on which the closing price of MSFT has been greater than $50. Keep this query standing in the system for a thousand trading days”. SELECT closingPrice, timestamp FROM ClosingStockPrices WHERE stockSymbol = ‘MSFT’ And closingPrice > 50.00 for (t = 101; t <= 1100; t++ ) { WindowIs(ClosingStockPrices, 101, t); } MSFT 101 $60 MSFT 102 $48 MSFT 103 $52 MSFT 104 $60

Example: Sliding Window Query The input windows of these queries have forward moving beginning and end points. Example: “On every third trading day starting today, calculate the average closing price of MSFT for the three most recent trading days. Keep the query standing for fifty trading days”. Select AVG(closingPrice) From ClosingStockPrices Where stockSymbol = ‘MSFT’ for (t = ST; t < ST + 50; t +=3 ) { WindowIs(ClosingStockPrices, t - 2, t); } MSFT 101 $60 MSFT 102 $48 MSFT 103 $52 MSFT 104 $56 MSFT 105 $55 MSFT 106 $58 MSFT 107 $52 MSFT 108 $60

Example: Temporal Band Join Query These queries join tuples in one stream with tuples in another based on timestamp. Example: “For the five most recent trading days starting today, select all stocks that closed higher than MSFT on a given day. Keep the query standing for twenty trading days”. Select c2.* FROM ClosingStockPrices as c1, ClosingStockPrices as c2 WHERE c1.stockSymbol = ‘MSFT’ and c2.stockSymbol!= ‘MSFT’ and c2.closingPrice > c1.closingPrice and c2.timestamp = c1.timestamp for (t = ST; t < ST +20 ; t++ ) { WindowIs(c1, t - 4, t); WindowIs(c2, t - 4, t); }

Pros and Cons of System Pros Cons Focus on extreme adaptability New code is multithreaded to help boost system parallelism and enhance performance particularly in multiprocessor scenarios. Cons Code not fully multi-threaded, existing PostgreSQL Queries separated into classes for processing based on disjoint footprints. Still in early development stages Issues still need to be solved no extensive performance analysis

TelegraphCQ Future Work Egress Modules Include fault tolerance in delivery of results, ie: in mobile networks Improved interface with overlay networks Cluster and Distributed Implementations Extension of FLuX module Integration with TAG system

Conclusion and Review NiagaraCQ TelegraphCQ NiagaraCQ is a system that establishes scalability with a general strategy of incremental group optimization. TelegraphCQ TelegraphCQ is a system that combines prior work in Fjords, Eddies, and PSoup in order to query streaming data on large scales Other Data Streaming solutions Aurora STREAM StreamMill

Thank You for your time!

Bibliography J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proc. Of the ACM SIGMOD Conf. on Management of Data, 2000. Xiaoning Wang, NiagaraCQ presentation. www.cs.wpi.edu/~cs561/s03/talks/niagara-cq.ppt Chandrasekaran, et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. UC Berkeley. 2003 CIDR Conference.