CS411 Database Systems Kazuhiro Minami 11: Query Execution.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

1 Lecture 23: Query Execution Friday, March 4, 2005.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Dr. Kalpakis CMSC 661, Principles of Database Systems Query Execution [15]
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
15.3 Nested-Loop Joins By: Saloni Tamotia (215). Introduction to Nested-Loop Joins  Used for relations of any side.  Not necessary that relation fits.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
Lecture 24: Query Execution Monday, November 20, 2000.
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
15.3 Nested-Loop Joins - Medha Pradhan - ID: CS 257 Section 2 - Spring 2008.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.
CPS216: Advanced Database Systems Notes 06:Query Execution (Sort and Join operators) Shivnath Babu.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.
CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
CPS216: Advanced Database Systems Notes 07:Query Execution (Sort and Join operators) Shivnath Babu.
CS4432: Database Systems II Query Processing- Part 3 1.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CPS216: Advanced Database Systems Notes 07:Query Execution (Sort and Join operators) Shivnath Babu.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
CS4432: Database Systems II Query Processing- Part 1 1.
CS 540 Database Management Systems
CS 440 Database Management Systems
Evaluation of Relational Operations
Lecture 2- Query Processing (continued)
Lecture 24: Query Execution
Lecture 13: Query Execution
Query Execution Index Based Algorithms (15.6)
Lecture 23: Query Execution
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Evaluation of Relational Operations: Other Techniques
Lecture 22: Query Execution
CSE 444: Lecture 25 Query Execution
Lecture 22: Query Execution
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

CS411 Database Systems Kazuhiro Minami 11: Query Execution

Goals of Query Execution Given a RA operator such as Join, we want to design an algorithm to implement it What factor do we need to consider? What are the following cost parameters? –B(R) –T(R) –V(R, a) –M–M 2

Challenges in Query Execution Remember that only place we can modify data within a block is main memory However, we often don’t have enough main memory buffer to keep all the records of a relation Q1. We want to sort records in a table, which is bigger than main memory. How can we do this? Q2. We want to join relations, which do not fit in main memory. How can do this? 3

Q1: How to sort records in a large table? 4 Disk Main memory buffer 2, 35, 8 9, 43, 8 10, 19, 7 1, 54, 3 2, 57, 7 9, 101, 7 Each block or page contains 2 records Read/write blocks You can sort records on memory

Divide-and-conquer Approach 1.Divide the unsorted list into two sublists of about half the size 2.Sort each sublist recursively 3.Merge the two sublists back into one sorted list 5 MergeSort Q: Can we apply this technique to our problem?

(Human) Merge Sort Algorithm 1.Receive an unsorted list from your parent 2.If the list contains more than one, a)divide it into two unsorted sublist of the same size b)Find two children (i.e., your classmates) and pass them each of the sublists c)Receive your children’s sorted sublists from the smallest elements of the two list 3.Return elements in the sorted list from the smallest one while coordinating with your sibling 6

Example Merge Process Merge

Important differences from the standard Merge-sort Divide an unsorted list into sublist of size M –Q: why M? Combine multiple sorted sublists into a single sorted list –Q: how many sublists do we merge? 8 Merge

External Merge-Sort Phase one: load M blocks in memory, sort –Result: B(R)/M sorted sublists of size M M blocks of main memory Disk... M blocks B(R)/M sorted sublists R

Phase Two Merge M – 1 runs into a new run Result: runs have now M (M – 1) blocks M blocks of main memory Disk... Input M-1 Input 1 Input 2.. Output M-1 sublists

Cost of Two-Phase, Multiway Merge Sort Step 1: sort M-1 sublists of size M, write –Cost: 2B(R) Step 2: merge M-1 sublists, but include each tuple only once –Cost: B(R) Total cost: 3B(R), Assumption: B(R) <= M 2

Update this figure: Cost of External Merge Sort Number of passes: Think differently –Given B = 4KB, M = 64MB, R = 0.1KB –Pass 1: runs of length M/R = Have now sorted runs of records –Pass 2: runs increase by a factor of M/B – 1 = Have now sorted runs of 10,240,000,000 = records –Pass 3: runs increase by a factor of M/B – 1 = Have now sorted runs of records Nobody has so much data ! Can sort everything in 2 or 3 passes !

If input relations are sorted, we can do many other operations easily Duplicate removal  (R) Grouping and aggregation operations Binary operations: R ∩ S, R U S, R – S Check total cost and assumption of each method with the textbook 15.4

If two relations are sorted, we can perform Join (Simple Sort-based Join) Sort both R and S on the join attribute: –Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match tuples –Cost: B(R)+B(S) Difficulty: many tuples in R may match many in S –If at least one set of tuples fits in M, we are OK –Otherwise need nested loop, higher cost Total cost: 5B(R)+5B(S) Assumption: B(R) <= M 2, B(S) <= M 2, and the tuples with a common value for the join attributes fit in M.

The Problem Regarding Too Many Tuples with the same join attributes Sorted lists a a a a a a Disk Main memory buffers Join (a,a) a a Output buffer (a,a)(a,a)(a,a)(a,a)(a,a)(a,a)

We could do better if there are not many tuples with the same join attribute (Sort-Merge-Join) Idea: compute the join during the merge phase Total cost: 3B(R)+3B(S) Assumption: B(R) + B(S) <= M 2 Sorted sublists R S Main memory join M-1 input buffers Output bufferdisk

Q2: How to join two large tables without sorting them? 17 Disk R S Main memory

Tuple-based Nested Loop Joins Join R S for each tuple r in R do for each tuple s in S do if r and s join then output (r,s) Cost: T(R) T(S), or B(R) B(S) if R and S are clustered Q: How many memory buffers do we need?

Block-based Nested Loop Joins for each (M-1) blocks bs of S do for each block br of R do for each tuple s in bs do for each tuple r in br do if r and s join then output(r,s) Organize access to both argument relations by blocks Use as much main memory as we can to store tuples belonging to relation S, the relation of the outer loop

Block-based Nested Loop Joins... R & S Hash table for block of S (k < B-1 pages) Input buffer for R Output buffer... Join Result joined tuples

Block-based Nested Loop Joins Cost: –Read S once: cost B(S) –Outer loop runs B(S)/(M-1) times, and each time need to read R: costs B(S)B(R)/(M-1) –Total cost: B(S) + B(S)B(R)/(M-1) Notice: it is better to iterate over the smaller relation first S R: S=outer relation, R=inner relation

Divide-and-conquer Approach Again If one of input relations fit into main memory, our job is easy So, we want to divide relations R and S into sub- relations R 1,…,R n and S 1,…,S n such that R S = (R 1 S 1 ) U … U (R n S n ) Q: How can we divide R and S? 22

Hashing-Based Algorithms Hash all the tuples of input relations using an appropriate hash key such that: –All the tuples that need to be considered together to perform an operation goes to the same bucket Perform the operation by working on a bucket (a pair of buckets) at a time – Apply a one-pass algorithm for the operation Reduce the size of input relations by a factor of M

Partitioned Hash Join R S Step 1: –Hash S into M buckets –send all buckets to disk Step 2 –Hash R into M buckets –Send all buckets to disk Step 3 –Join every pair of buckets

Partitioned Hash-Join Partition tuples in R and S using join attributes as key hash Tuples in partition R i only match tuples in partition S i. Relation R OUTPUT 2 INPUT 1 hash function h M-1 Buckets R1R1 R2R2 R M-1... Relation S OUTPUT 2 INPUT 1 hash function h M-1 Buckets S1S1 S2S2 S M-1...

Partitioned Hash-Join: Second Pass Read in a partition of R i, hash it using another hash function h’ Scan matching partition of S, search for matches. Buckets for R Buckets for S R1R1 R2R2 R M-1 S1S1 S2S2 S M-1 hash function h’ Main Memory Buffers M-1 blocks Join

Partitioned Hash Join Cost: 3B(R) + 3B(S) Assumption: min(B(R), B(S)) <= M 2

Sort-based vs. Hash-based Algorithms Hash-based algorithms for binary operations have a size requirement only on the smaller of two input relations Sort-based algorithms sometimes allow us to produce a result in sorted order and take advantage of that sort later Hash-based algorithm depends on the buckets being of equal size, which may not be true if the number of different hash keys is small

Two-Pass Algorithms Based on Index

Index-based Algorithms The existence of an index on one ore more attributes of a relations makes available some algorithm that would not be feasible without the index Useful for selection operations Also, algorithms for join and other binary operations use indexes to good advantage

Clustering indexes In a clustered index all tuples with the same value of the key are clustered on as few blocks as possible a a aa a a a aa All the ‘a’ tuples Q: how many blocks do we need to read?

Index Based Selection Selection on equality:  a=v (R) Clustered index on a: cost B(R)/V(R,a) Unclustered index on a: cost T(R)/V(R,a) We here ignore the cost of reading index blocks

Index Based Join R S Assume S has an index on the join attribute Iterate over R, for each tuple fetch corresponding tuple(s) from S Assume R is clustered. Cost: –If index is clustered: B(R) + T(R)B(S)/V(S,a) –If index is unclustered: B(R) + T(R)T(S)/V(S,a)

Index Based Join Assume both R and S have a sorted index (B- tree) on the join attribute Then perform a merge join (called zig-zag join) Cost: B(R) + B(S) Index

Summary One-pass algorithms (Read the textbook 15.2) –Read the data only once from disk –Usually, require at least one of the input relations fit in main memory Nested-Loop Join algorithms –Read one relation only once, while the other will be read repeatedly from disk Two-pass algorithms –First pass: read data from disk, process it, write it to the disk –Second pass: read the data for further processing