Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Slides:



Advertisements
Similar presentations
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Advertisements

Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Join Processing in Databases Systems with Large Main Memories
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Chapter 19 Query Processing and Optimization
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Query Processing and Optimization
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CSCE Database Systems Chapter 15: Query Execution 1.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Query Processing CS 405G Introduction to Database Systems.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
Spencer MacBeth Supervisor - Dr. Ramon Lawrence
CS 540 Database Management Systems
CS 440 Database Management Systems
A paper on Join Synopses for Approximate Query Answering
Evaluation of Relational Operations
COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.
Chapter 15 QUERY EXECUTION.
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
One-Pass Algorithms for Database Operations (15.2)
Overview of Query Evaluation: JOINS
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Presentation transcript:

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar

Problem Processing Joins over unbounded stream Solution: Moving Window Join Queries have window predicates: Two streams R and S, only interested in R tuples that have arrived in the last t1 seconds and S tuples that have arrived in the last t2 seconds.

Moving Window Join a : Arrival rate stream a, b : Arrival rate stream b Ta : stream A time window size, Tb : stream B time window size

Central Point The paper proposes a cost model for evaluation of moving window joins. Using this cost model, proposes strategies for maximizing the efficiency of processing joins in different scenarios

Background Implementation Strategies for Joins example R X S a = b Nested Loop Joins: (brute force) For each record t in R search for a retrieve every record s from S and test whether the two satisfy the condition t[a] = s[b] Hash Joins: 1.Inputs: build input (smaller) and probe input 2.Scan the build input and generate a hash table using a hashing function on attribute “a” 3.For each probe row, the hash key's value is computed, the corresponding hash bucket is scanned, and the matches are produced.

Algorithm moving window join (NLJ) For each arrival of a new tuple from stream A 1.Scan stream B’s window to find any matching tuples and propagate them to the result. 2.Insert the new tuple into stream A’s window. 3.Invalidate all expired tuples in stream A’s window these are just those tuples whose timestamp is now outside the current time window.

Questions 1.How can we measure the efficiency of a moving window join evaluation strategy, since the traditional metric of execution time to completion does not apply? 2.Can an algorithm for a moving window join take advantage of asymmetries in the rates of the input streams? 3. How can we deal with cases in which an input stream is so fast that the system cannot keep up? 4.If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

Cost Model Evaluating window joins in 3 Scenarios 1.One stream is faster than the other. To see whether we can exploit this to optimize the performance of join algorithm. 2.Resources are insufficient to keep up with the with the speed of the input streams. Service rate is slower than the arrival rate. 3.Memory is the constraining resource. The problem is following given a fixed amount of memory and flexible times of window, how can we adjust the window size in away that the tuples produced are maximized.

Cost Model Traditional Cardinality based Cost model is incapable, because of producing cost estimates, with streams since the algorithm may never complete. We need something that measures the rate at which the output is generated and then optimize the algorithm to maximize this rate. This is called rate based query optimization

Cost Model A cost formula for computing the window join Each arrival in A’s window triggers three task, same for arrival in window B. The same formula will be used hence forth for evaluating the cost for different implementations of joins. The parameters probe(b) etc.. will change for NLJ to B*C n,,. C n is the cost of accessing a single tuple and B is the number of tuples in B’s window.

Cost Model Cost of a single join operation can be divided into 2 independent components, one for each input stream. The following is the unit cost of joining A tuples to B tuples plus the invalidation and insertion cost for tuples into B. Aggregate cost of accessing window B in a single time unit.

Related Work Query Scrambling Adaptive Query Processing Streaming Algorithms for Hash Join (SHJ & XJoin) Diag-Join (data ware house environment): Most of the warehouse joins are performed on foreign keys, and matching tuples are likely to be found in the physical close time frame of their creation. Babu and Widom: proposed an architecture for a general purpose stream data management system and identified research problems in continuous query processing over streams.

Cost of Nested Loops Join A to B For Nested Loop Joins. Cost of the nested loop join = cost of accessing one tuple* number of tuples accessed in unit time number of tuples accessed = a B = a T b b = a (probe(b)) cost of insertion = 1 tuple, i.e. only the inserted tuple cost of invalidation = 1 tuple, on an average. cost of single tuple access = C n Putting it all together we get.

Cost of Hash Join A to B If Hash Join (HJ) is used Cost of probe(b) and invalidate(b) is a function of the hash bucket size in window B. Typical probe requires 1 key hashing and key comparison for each tuple. Number of tuples in a hash bucket in window B = T b b /|B| Again, we put all things together and get the cost formula for HJ

Cost of Full Joins Full Joins are categorized in to two types: Symmetric Joins: Same join mechanism is used from A to B, as well as from B to A. viz. HHJ ( Hash Joins from both sides) and NNJ ( Nested loop joins from both sides). Asymmetric Joins: Combination of HJ and NLJ is used. For example HNJ(Nested loop join from A to B and Hash Join from B to A) Some more formulas..

Cost Curves for full joins. So what results do we see in this graph above. ??

Observations from the previous graph When input stream differences are minimal, HJ outperforms every other join mechanism. As the difference increases, costs of HJ increase considerably and exceeds the HNJ. At about 70 tuples/sec ( graph 1) and 140 tuples/sec (graph 2), we have a performance crossover point.

Determining Crossover Points. In graph 1 we saw that the cross over point was 70 tuples/sec, which is roughly when input stream B is 7 times faster than stream A. To accurately calculate crossover points. Using the formulas obtained previously we get How is this equation useful???? For a given stream, we can determine when NLJ will outperform HJ depending on the ratio of the arrival of the input streams.

Maximizing Efficiency of Processing Joins The following 3 scenarios are considered: One stream much faster than the other Computing resources are insufficient to keep up with the speed of the input streams Memory resources are limited

Exploiting Asymmetry in Input Streams Speed Some assumptions: The two time windows are fixed. Aggregate speed of two streams is less that the system’s service rate  ( a + b <  ) The following inequality determines the likely winner between NLJ and HJ. If inequality holds, NLJ will outperform HJ, else HJ will outperform NLJ.

Graphs to prove the previous hypothesis. What observations can we make from these graphs.???? Increasing mismatch between input rates, decreases the performance of HHJ, before HNJ After reaching thrashing point, performance degradation of HNJ is less severe compared to HHJ

Maximizing the Number of Result Tuples with Limited Computing Resources. This scenario arises in the following cases: Evaluation of expensive predicates Input stream’s speed is faster than the join operator’s service rate. ( a + b >  ) Consequences??? All tuples cannot be generated or else system falls behind Streams need to be ‘regulated’ by dropping some tuples. But, what policy should be adopted while regulating the streams? There are 3 basic choices: 1) Proportional to input rates. 2) Proportional to window size. 3) Equal distribution

We have a winner !!!! The equal distribution strategy is the winner in this case. Also mathematical analysis of the cost model proposed in the paper, confirms the result. Maximum output tuples will be generated when, ratio of two input streams is equal to 1.

Maximizing the Number of Result Tuples with Limited Memory. Assumptions: We have a variable time window. The arrival rate is constant. Memory is a constraint, hence memory allocation strategies are needed. What are the different ways in which we can allocate memory to strings ???? All to one. We allocate all resources to one stream, either the slower one, or the faster one. Proportional to the arrival rate, either direct or inverse. Equal Distribution (our winner in the last case). ( Will Equal Distribution win again ?????)

A New Winner !!!!! The Max A strategy, which allocates all memory to the slower stream is the clear winner. In this strategy, we keep the slower stream in memory and let the faster one probe against it and pass by, thus maximizing the tuples. Mathematical Analysis of the cost-model confirms this result.

Conclusions and Future Work A unit-time basis model to analyze expected performance of moving window joins is introduced. The proposed cost-model divides the join cost into two independent terms, each corresponding to one of the two join directions. This work can be extended to have a cost model beyond single joins and for full query plans. Other algorithms apart from NLJ and NJ can be modeled and evaluated.