Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.

Slides:

Advertisements

Similar presentations

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Advertisements

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

1 Lecture 23: Query Execution Friday, March 4, 2005.

Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.

Lecture 24: Query Execution Monday, November 20, 2000.

SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:

1 Lecture 22: Query Execution Wednesday, March 2, 2005.

15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 242 Database Systems II Query Execution.

CSCE Database Systems Chapter 15: Query Execution 1.

Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.

CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.

CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.

Indexing and Execution May 20 th, Indexing - Recap Primary file organization: heap, sorted, hashed. Primary vs. secondary Clustered vs. unclustered.

CS4432: Database Systems II Query Processing- Part 3 1.

CS411 Database Systems Kazuhiro Minami 11: Query Execution.

CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.

Lecture 24 Query Execution Monday, November 28, 2005.

Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

CS4432: Database Systems II Query Processing- Part 2.

CSCE Database Systems Chapter 15: Query Execution 1.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Query Processing CS 405G Introduction to Database Systems.

Lecture 17: Query Execution Tuesday, February 28, 2001.

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

CS 540 Database Management Systems

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.

1 Lecture 23: Query Execution Monday, November 26, 2001.

CS4432: Database Systems II Query Processing- Part 1 1.

CS 540 Database Management Systems

CS 440 Database Management Systems

Evaluation of Relational Operations

Evaluation of Relational Operations: Other Operations

Lecture 2- Query Processing (continued)

Issues in Indexing Multi-dimensional indexing:

Lecture 24: Query Execution

Lecture 13: Query Execution

Lecture 23: Query Execution

Evaluation of Relational Operations: Other Techniques

Overview of Query Evaluation: JOINS

Lecture 22: Query Execution

CSE 444: Lecture 25 Query Execution

Lecture 22: Query Execution

Lecture 11: B+ Trees and Query Execution

CSE 544: Query Execution Wednesday, 5/12/2004.

Lecture 22: Friday, November 22, 2002.

Evaluation of Relational Operations: Other Techniques

Lecture 24: Query Execution

Lecture 20: Query Execution

Presentation transcript:

Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016

Why Do We Learn This? How to implement (efficiently) SQL queries in a relational database system? As a DBA, how to tune you database system in order to have better performance? Must-know knowledge if you want to get a job in Oracle, Microsoft SQL-Server branch, IBM DB2 branch, Google, …… 1

Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms 2

Query Processing 3 Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/Application Query or update Query execution plan Record, index requests Page commands Read/write pages

Logical vs. Physical Operators Logical operators – what they do – e.g., union, selection, projection, join, grouping, …… Physical operators – how they do it – e.g., nested loop join, sort-merge join, hash join, index join 4

Query Execution Plans 5 Purchase Person Buyer=name City=‘tallahassee’ phone>’ ’ buyer (Simple Nested Loops) SELECT S.sname FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘tallahassee’ AND Q.phone > ‘ ’  Query Plan: logical tree implementation choice at every node scheduling of operations (Table scan)(Index scan) Some operators are from relational algebra, and others (e.g., scan, group) are not

How do We Combine Operations? The iterator model: Each operation is implemented by three functions: – Open: sets up the data structures and performs initializations – GetNext: returns the next tuple of the result, or NotFound if there are no more tuples to return – Close: ends the operations and cleans up the data structures Enables pipelining! 6

Example: Table Scan 7

The Iterator Model Query execution always begins with the root iterator – Parent.Open() invokes Child.Open() and sometimes Child.Next() – Parent.Next() invokes Child.Next() or nothing – Parent.Close() invokes Child.Close() 8

Query Execution: Materialization vs. Pipelining Materialization - result of each operation is stored on disk until it is needed by another operation – high disk I/O – one operations at time Pipelining - result of each operation is created in the main memory and passed directly to operation that uses it – save disk I/O – operations are performed simultaneously – fails when the size of results and intermediate data structures exceed the limits of the main memory 9

Cost Parameters Cost parameters – M = number of blocks that fit in main memory – B(R) = number of blocks holding R – T(R) = number of tuples in R – V(R,a) = number of distinct values of the attribute a Estimating the cost: – Important in optimization (next lecture) – Compute I/O cost only (memory  disk) accesses to disks are much slower than accesses to RAM! – We compute the cost to read the tables We don’t compute the cost to write the final result (because pipelining) 10

Classification of Algorithms One-Pass algorithms read the data from disk only once. Work when at least one of the relations fits in the main memory – Exception: projection and selection Two-Pass algorithms read the data from disc twice. Work for the data which does not fit in the main memory, but not for the largest imaginable data – Sorting-based – Hash-based Multi-Pass algorithms read the data three or more times, and are natural, recursive generalizations of the two-pass algorithms. Work without limit on the size of the data 11

Table Scanning If the table is clustered (blocks consists only of records from this table): – Table-scan: if we know where the blocks are Cost: B(R) – Index scan: if we have index to find the blocks Cost: B(R) If the table is unclustered (its records are placed on blocks with other tables) – May need one read for each record – Cost: T(R) 12

One-pass Algorithms Selection  (R), projection π(R) – Both are tuple-at-a-time algorithms Read a block at a time, use one memory buffer and produce the output – Cost B(R) if R is clustered T(R) if R is unclustered – Assumption: M ≥1 for the input buffer, regardless of B 13 Input bufferOutput buffer Unary operator

One-pass Algorithms Duplicate elimination:  (R) – Need to keep a dictionary in memory in order to maintain one copy of every tuple we have seen: balanced search tree O(n logn) hash table O(n) – Memory Requirements 1 buffer to store one block of input tuples M − 1 buffers to store one copy of each distinct tuple Cost: B(R) Assumption: B(  (R)) <= M 14

One-pass Algorithms Grouping:  city, sum(price) (R) – Need to keep a dictionary in memory – Also store the sum(price) for each city – Cost: B(R) – Assumption: number of cities fits in memory Binary operations: R ∩ S, R U S, R – S – Assumption: min(B(R), B(S)) <= M – One buffer used to read the blocks of the larger relation, while M- 1 buffers needed to house the entire smaller relation and its main memory-data structures – Cost: B(R)+B(S) 15

Nested Loop Joins Tuple-based nested loop R S – R=outer relation, S=inner relation for each tuple r in R do for each tuple s in S do if r and s joinable then output (r, s) Cost: T(R) T(S), sometimes T(R) B(S) 16

Nested Loop Joins Block-based Nested Loop Join for each (M-1) blocks bs of S do for each block br of R do for each tuple s in bs do for each tuple r in br do if r and s joinable then output(r, s) 17

Nested Loop Joins R & S Blocks of S (k < M-1 pages) Input buffer for R Output buffer... Join Result...

Nested Loop Joins Block-based Nested Loop Join Cost: – Read S once: cost B(S) – Outer loop runs B(S)/(M-1) times, and each time need to read R: costs B(S)B(R)/(M-1) – Total cost: B(S) + B(S)B(R)/(M-1) Notice: it is better to iterate over the smaller relation first– i.e., S smaller 19

Summary of One-pass Algorithms 20 OperatorsM RequiredDisk I/O  π 1B  BB U, ∩, -, XMin(B(R), B(S))B(R)+B(S) joinany M≥2B(R)B(S)/M

Two-pass Algorithm Relations are larger than what the one-pass algorithms can handle Philosophy – Pass 1: organize the data properly into a group of sub-relations, and write out to disk again Sorting Indexing – Pass 2: reread each sub-relation, do the operation 21

Review: Main-Memory Merge Sort 22

Sorting  2 Pass Multi-way Merge Sort (2PMMS) – Step 1: Read M blocks at a time, sort, write Result: ┌ B/M ┐ runs of length M on disk (the last run may be shorter) – Step 2: Merge M-1 at a time, write to disk (M-1): input buffer 1: output buffer Cost: 3B(R) – One read and one write for step 1 – One read for merging at step 2 Assumption: B(R) ≤ M 2 23

Example 24

Example 1G Memory, Block size 64K – M = 2^30/ 2^16 = 16K A relation fitting in B blocks can be sorted if – B ≤ (16K)^2 = 2^28 – Each block is 64K – Then the largest table affordable is 2^28 * 2^16 = 2^42 bytes = 4 Terabytes! Even on a modest machine, 2PMMS is sufficient to sort all but an incredibly large relation in two passes 25

Two-Pass Algorithms Based on Sorting Duplicate elimination  (R) Simple idea: like sorting, but include no duplicates Step 1: sort runs of size M, write – Cost: 2B(R) Step 2: merge M-1 runs, but include each tuple only once – Cost: B(R) Total cost: 3B(R), Assumption: B(R) <= M 2 26

Two-Pass Algorithms Based on Sorting Grouping:  city, sum(price) (R) Same as before: sort based on city, then compute the sum(price) for each group As before: compute sum(price) during the merge phase Total cost: 3B(R) Assumption: B(R) <= M 2 27

Two-Pass Algorithms Based on Sorting Binary operations: R ∩ S, R U S, R – S Idea: sort R, sort S, then do the right thing A closer look: – Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S) – Step 2: merge all x runs from R; merge all y runs from S; ouput a tuple on a case by cases basis (x + y <= M) Total cost: 3B(R)+3B(S) Assumption: B(R)+B(S)<= M 2 28

Two-Pass Algorithms Based on Sorting Join R S Start by sorting both R and S on the join attribute: – Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match tuples – Cost: B(R)+B(S) Difficulty: many tuples in R may match many in S – If at least one set of tuples fits in M, we are OK – Otherwise need nested loop, higher cost Total cost: 5B(R)+5B(S) Assumption: B(R) <= M 2, B(S) <= M 2 29

Summary of Two-pass Algorithms Based on Sorting 30 OperatorsM RequiredDisk I/O  3B U, ∩, -3(B(R)+B(S)) join5(B(R)+B(S))

Two Pass Algorithms Based on Hashing Idea: partition a relation R into buckets, on disk Each bucket has size approx. B(R)/M Does each bucket fit in main memory ? – Yes if B(R)/M <= M, i.e. B(R) <= M 2 31 M main memory buffers Disk Relation R OUTPUT 2 INPUT 1 hash function h M-1 Partitions 1 2 M B(R)

Hash Based Algorithms for   (R)  duplicate elimination – Step 1. Partition R into buckets – Step 2. Apply  to each bucket (may read in main memory and apply one-pass algorithm) Cost: 3B(R) Assumption: B(R) <= M 2 32

Hash Based Algorithms for   (R)  grouping and aggregation – Step 1. Partition R into buckets Choose the hash function depending only on the grouping attributes – Step 2. Apply  to each bucket (may read in main memory) Cost: 3B(R) Assumption: B(R) <= M 2 33

Hash-based Join R S Simple version: main memory hash-based join – Scan S, build buckets in main memory – Then scan R and join Requirement: min(B(R), B(S)) <= M 34

Partitioned Hash Join R S – Step 1: Hash S into M buckets send all buckets to disk – Step 2 Hash R into M buckets (the same hash function on join attributes) Send all buckets to disk – Step 3 Join every pair of buckets – Cost: 3B(R) + 3B(S) – Assumption: At least one full bucket of the smaller relation must fit in memory: min(B(R), B(S)) <= M 2 35

Partitioned Hash- Join – Partition both relations using hash fn h R tuples in partition i will only match S tuples in partition i – Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches 36 Partitions of R & S Input buffer for Ri Hash table for partition Si ( < M-1 pages) main memory buffers Disk Output buffer Disk Join Result hash fn h2 main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h M-1 Partitions 1 2 M-1...

Summary of Two-pass Algorithms Based on Hashing 37 OperatorsM RequiredDisk I/O  3B U, ∩, -3(B(R)+B(S)) join3(B(R)+B(S))

Indexed Based Algorithms Zero-pass! In a clustering index all tuples with the same value of the key are clustered on as few blocks as possible 38 a a aa a a a aa

Index Based Selection Selection on equality:  a=v (R) – Clustered index on a: cost B(R)/V(R,a) – Unclustered index on a: cost T(R)/V(R,a) Example: B(R) = 2000, T(R) = 100,000, V(R, a) = 20, compute the cost of  a=v (R) Cost of table scan: – If R is clustered: B(R) = 2000 I/Os – If R is unclustered: T(R) = 100,000 I/Os Cost of index based selection: – If index is clustered: B(R)/V(R,a) = 100 – If index is unclustered: T(R)/V(R,a) = 5000 Notice: when V(R,a) is small, then unclustered index is useless 39

Index Based Join R S Assume S has an index on the join attribute – Iterate over R, for each tuple fetch corresponding tuple(s) from S – Assume R is clustered. Cost: If index is clustered: B(R) + T(R)B(S)/V(S,a) If index is unclustered: B(R) + T(R)T(S)/V(S,a) Assume both R and S have a sorted index (B+ tree) on the join attribute – Then perform a merge join (called zig-zag join) – Cost: B(R) + B(S) 40

Tallahassee, Florida, 2016 Have Fun!