CS4432: Database Systems II Query Processing- Part 2.

Slides:



Advertisements
Similar presentations
1 Lecture 23: Query Execution Friday, March 4, 2005.
Advertisements

15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Bhargav Vadher (208) APRIL 9 th, 2008 Submittetd To: Dr. T Y Lin Computer Science Department San Jose State University.
Dr. Kalpakis CMSC 661, Principles of Database Systems Query Execution [15]
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
COMP 451/651 Optimizing Performance
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
Lecture 24: Query Execution Monday, November 20, 2000.
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Compiler: 16.7 Completing the Physical Query-Plan CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung ID: 212.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Query Processing COMP3017 Advanced Databases Nicholas Gibbins
CS4432: Database Systems II Query Processing- Part 1 1.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Selected Topics: External Sorting, Join Algorithms, …
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Implementation of Relational Operations
Lecture 24: Query Execution
Lecture 13: Query Execution
Lecture 23: Query Execution
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
Lecture 11: B+ Trees and Query Execution
Lecture 22: Friday, November 22, 2002.
Evaluation of Relational Operations: Other Techniques
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

CS4432: Database Systems II Query Processing- Part 2

Overview of Query Execution SQL Query  Compile  Optimize  Execute

Logical Plans vs. Physical Plans Physical plan means how each operator will execute (which algorithm) – E.g., Join can be nested-loop, hash-based, merge-based, or sort-based Each logical plan will map to multiple physical plans Logical Plan One Physical Plan

Evaluating Relational Operators

Top-Down vs. Bottom-Up Evaluation Projection Project the “title” Top-Down Evaluation – The top operator requests a tuple from the operator below it (Recursive) – Tuples flow only when requested (pull-based) Bottom-Up Evaluation – The bottom operators push their tuples upward – Tuples flow when ready (push-based) Most DBMSs apply the Top- Down Evaluation

Common Techniques For Evaluating Operators Algorithms for evaluating relational operators use some simple ideas extensively: Indexing: Can use WHERE conditions to retrieve small set of tuples (selections, joins) Iteration: Sometimes, faster to scan all tuples even if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.) Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.

Another Categorization One Pass Algorithms – Need one pass over the input relation(s) – Puts limitations on the size of the inputs vs. memory Two Pass Algorithms – Need two pass over the input relation(s) – Puts limitations on the size of the inputs vs. memory Multi-Pass Algorithms – Scale to any size and may need several passes over the input relation(s)

Categorizing Algorithms By Underlying Technique – Sort-based – Hash-based – Index-based By the number of times data is read from disk (Passes) – One-pass – Two-pass – Multi-pass (more than 2) By what the operators work on – Tuple-at-a-time, unary – Full-relation, unary – Full-relation, binary

Common Statistics over Relation R B(R): # of blocks to hold all R tuples T(R): # tuples in R S(R): # of bytes in each of R’s tuple V(R, A): # distinct values in attribute R.A M: # of memory buffers available R R R is “clustered”  R’s tuples are packed into blocks  Accessing R requires B(R) I/Os R is “not clustered”  R’s tuples are distributed over the blocks  Accessing R requires T(R) I/Os

Example: Join (R,S) One Pass Iteration Open(): read S into memory GetNext(): for b in blocks of R: for t in tuples of b: if t matches tuple s: return join (t,s) return NotFound Close(): Clean memory Assume S is smaller than R Key Metrics (memory Req.): – M >= B(S) + 1 I/O Cost: – B(S) + B(R) Notes: – Can use prefetching for R Join R S For this join algorithm to work: S must fit in memory One additional buffer for R

Example: Duplicate Elimination Keep a main memory search data structure D (use search tree or hash table) to store one copy of each tuple  (M-1 Buffers) Read in each block of R one at a time (use table scan)  (1 buffer) For each tuple check if it appears in D – If Yes, then skip – If Not, then add it to D and to the output buffer One Pass Iteration Distinct R 1 memory buffer for reading M-1 memory buffers for storing distinct copies The distinct tuples of R must fit in M-1 Buffers >> B(  (R)) <= M-1 >> As an approximation B(  (R)) <= M What are the constraints for this algorithm to work in one pass? What is the I/O Cost B(R)

Example: Duplicate Elimination What if relation R is sorted How the duplicate elimination op. works ??? Are there any size constraints to be in one pass ??? What is the I/O cost ??? Distinct R

Example: Duplicate Elimination (Cont’d) What if relation R is sorted How the duplicate elimination op. works ??? – No need for the M-1 Buffers (we keep only the last reported tuple) Are there any size constraints to be in one pass ??? – No (1 memory buffer to handle R of any size) What is the I/O cost ??? – B(R) Distinct R  Each operator must know the properties of its input relations (Sorted or not, grouped or not, …)  Makes big difference in execution and performance  Each operator must know the properties of its input relations (Sorted or not, grouped or not, …)  Makes big difference in execution and performance

Example: Group By Keep a main memory search data structure D (use search tree or hash table) to store one entry for each group  (M-1 Buffers) Read in each block of R one at a time (use table scan)  (1 buffer) For each tuple, update its group statistics One Pass Iteration Group By R 1 memory buffer for reading M-1 memory buffers for storing one entry for each group The groups must fit in M-1 buffers Cannot be written in terms of B(R) or T(R) Worst case: Each tuple is a group What is the I/O Cost B(R) Update group statistics What are the constraints for this algorithm to work in one pass?

Example: Set Union(R,S) One Pass Iteration Assume S is smaller than R Union R S Read smaller relation into main memory (S)  M-1 Buffers Use main memory search structure D to allow tuples to be inserted and found quickly Produce S’s tuples to output as you read them Read from R one block at a time  1 Buffer – If tuple exists in D, skip – Otherwise, write to output What is the I/O Cost What are the constraints for this algorithm to work in one pass? Min(B(R), B(S)) <= M-1 (or M as approximation) B(R) + B(S)

Blocking vs. Non-Blocking Operators Blocking operator cannot produce any tuples to the output until it processes all its inputs Non-blocking operator can produce tuples to output without waiting until all input is consumed For the operators we have seen so far, which one is blocking ??? – Join, duplicate elimination, union  Non-blocking – Grouping  Blocking – Others??? Selection, Projection  Non-blocking – Others??? Sorting  Blocking

Two-Pass Algorithms

Sort-based two-pass algorithms – The first pass does a sort on some parameter(s) of each operand – The second pass algorithm relies on the sort results and can be pipelined Hash-based two-pass algorithms First Pass: Do a prep-pass and write the intermediate result back to disk >> We count Reading + Writing Second Pass: Read from disk and compute the final results >> We count Reading only (if it is the final pass)

Example: 2-Pass External Sort Sort R Phase 1: Read M blocks at a time, sort them, write to disk as one run Each run is sorted of size M (we have B(R)/M runs) Phase 2: Merge the runs and produce the sorted output (each run must have one memory buffer) B(R)/M runs What is the I/O Cost What are the constraints for this algorithm to work in one pass?

Example: 2-Pass External Sort Sort R Phase 1: Read M blocks at a time, sort them, write to disk as one run Each run is sorted of size M (we have B(R)/M runs) Phase 2: Merge the runs and produce the sorted output (each run must have one memory buffer) B(R)/M runs What are the constraints for this algorithm to work? Phase 1  no constraints Phase 2  each run must have a memory buffer + one for output >> B(R)/M <= M-1 >> Approx. B(R)/M <= M >> B(R) <= M 2

Example: 2-Pass External Sort Sort R Phase 1: Read M blocks at a time, sort them, write to disk as one run Each run is sorted of size M (we have B(R)/M runs) Phase 2: Merge the runs and produce the sorted output (each run must have one memory buffer) B(R)/M runs Phase 1  2 x B(R) [reading & writing] Phase 2  B(R) [reading] Total 3 B(R) What is the I/O Cost

Sort-Based Duplicate Elimination Same as sorting, except that: – While merging in Phase 2, eliminate the duplicates and produce one copy from each group of identical tuples Distinct R Eliminate duplicates What is the I/O Cost What are the constraints for this algorithm to work in one pass? Same as the sorting operator itself

Sort-Based Join Join R S Remember…. For one-pass join, the smaller relation must fit in memory – B(S) <= M What if both relations are large?

Naïve Two-Pass JOIN (Sort-Join) 1.Sort R and S on the join key 2.Merge and join the sorted R and S Join R S Step 1 (Sorting each Relation) R 2-Pass Sort Sorted R S 2-Pass Sort Sorted S

Naïve Two-Pass JOIN 1.Sort R and S on the join key 2.Merge and join the sorted R and S Join R S Step 2 (Merge and Join R & S) Sorted R Sorted S Memory Output buffer Joined output Read one block from each relation at a time, join the tuples that exist in both relations When one block is consumed, read the next block from its relation What is the I/O Cost What are the constraints for this algorithm to work in one pass?

Naïve Two-Pass JOIN Join R S What is the I/O Cost I/O Cost = 4 B(R) I/O Cost = 4 B(S) I/O Cost = B(R) + B(S) Total I/O Cost = 5( B(R) + B(S)) Notice: we counted the output writing since it is intermediate

Naïve Two-Pass JOIN Join R S What are the constraints >> B(R) <= M 2 >> B(S) <= M 2 No Constraints From the sorting algorithm

Efficient Two-Pass JOIN ( Sort-Merge-Join) Main Idea: Combine Pass 2 of the Sort with the Join Join R S Phase 1 in Sorting As Is R Sorted runs of R ( we have B(R)/M) Sorted runs of S ( we have B(S)/M) S Phase 2 Merge & Join Memory One buffer for each sorted run from both R & S One buffer for the join output Output buffer

Efficient Two-Pass JOIN ( Sort-Merge-Join) Main Idea: Combine Pass 2 of the Sort with the Join Join R S Phase 1 in Sorting As Is R Sorted runs of R ( we have B(R)/M) Sorted runs of S ( we have B(S)/M) S Phase 2 Merge & Join Memory One buffer for each sorted run from both R & S One buffer for the join output Output buffer What is the I/O Cost 2 B(R) 2 B(S) B(R) + B(S) Total Cost = 3 (B(R) + B(S))

Efficient Two-Pass JOIN ( Sort-Merge-Join) Main Idea: Combine Pass 2 of the Sort with the Join Join R S Phase 1 in Sorting As Is R Sorted runs of R ( we have B(R)/M) Sorted runs of S ( we have B(S)/M) S Phase 2 Merge & Join Memory One buffer for each sorted run from both R & S One buffer for the join output Output buffer No Constraints What are the constraints No Constraints Number of runs must fit in memory: B(R)/M + B(S)/M <= M  B(R) + B(S) <= M 2