Lecture 22: Query Execution

Slides:

Advertisements

Similar presentations

1 Lecture 23: Query Execution Friday, March 4, 2005.

Advertisements

Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.

CS CS4432: Database Systems II Operator Algorithms Chapter 15.

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Lecture 24: Query Execution Monday, November 20, 2000.

1 Lecture 22: Query Execution Wednesday, March 2, 2005.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.

CSCE Database Systems Chapter 15: Query Execution 1.

1 Lecture 23: Query Execution Wednesday, March 8, 2006.

CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.

CS4432: Database Systems II Query Processing- Part 3 1.

CS411 Database Systems Kazuhiro Minami 11: Query Execution.

Lecture 24 Query Execution Monday, November 28, 2005.

Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

CS4432: Database Systems II Query Processing- Part 2.

CSCE Database Systems Chapter 15: Query Execution 1.

Lecture 17: Query Execution Tuesday, February 28, 2001.

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

CS 540 Database Management Systems

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.

Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.

1 Lecture 23: Query Execution Monday, November 26, 2001.

CS4432: Database Systems II Query Processing- Part 1 1.

CS 540 Database Management Systems

CS 440 Database Management Systems

Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.

Lecture 16: Relational Operators

Evaluation of Relational Operations

Database Management Systems (CS 564)

Evaluation of Relational Operations: Other Operations

Cse 344 April 25th – Disk i/o.

Implementation of Relational Operations (Part 2)

Relational Operations

Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Selected Topics: External Sorting, Join Algorithms, …

Query Execution Two-pass Algorithms based on Hashing

(Two-Pass Algorithms)

April 27th – Cost Estimation

Lecture 2- Query Processing (continued)

Lecture 25: Query Execution

Relational Algebra Friday, 11/14/2003.

Lecture 24: Query Execution

Lecture 25: Query Optimization

Lecture 13: Query Execution

Query Execution Index Based Algorithms (15.6)

Lecture 23: Query Execution

Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.

Evaluation of Relational Operations: Other Techniques

Overview of Query Evaluation: JOINS

Lecture 22: Query Execution

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

CSE 444: Lecture 25 Query Execution

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Lecture 11: B+ Trees and Query Execution

CSE 544: Query Execution Wednesday, 5/12/2004.

Lecture 23: Monday, November 25, 2002.

Lecture 22: Friday, November 22, 2002.

Evaluation of Relational Operations: Other Techniques

Lecture 24: Query Execution

Lecture 20: Query Execution

Presentation transcript:

Lecture 22: Query Execution Monday, March 6, 2006

Outline Query execution: 15.1 – 15.5

Architecture of a Database Engine SQL query Parse Query Logical plan Select Logical Plan Query optimization Select Physical Plan Physical plan Query Execution

Logical Algebra Operators Union, intersection, difference Selection s Projection P Join |x| Duplicate elimination d Grouping g Sorting t

Logical Query Plan T3(city, c) SELECT city, count(*) FROM sales GROUP BY city HAVING sum(price) > 100 P city, c T2(city,p,c) s p > 100 T1(city,p,c) g city, sum(price)→p, count(*) → c sales(product, city, price) T1, T2, T3 = temporary tables

Logical Query Plan Purchase(buyer, city) Person(name, phone) SELECT P.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND P.city=‘seattle’ AND Q.phone > ‘5430000’ Purchase(buyer, city) Person(name, phone) buyer  City=‘seattle’ phone>’5430000’ Buyer=name Purchase Person

Physical Query Plan Query Plan: logical tree SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’ Purchase Person Buyer=name City=‘seattle’ phone>’5430000’ buyer (Simple Nested Loops)  (Table scan) (Index scan) Query Plan: logical tree implementation choice at every node scheduling of operations. Some operators are from relational algebra, and others (e.g., scan) are not.

Question in Class Logical operator: Product(pname, cname) || Company(cname, city) Propose three physical operators for the join, assuming the tables are in main memory:

Question in Class Product(pname, cname) |x| Company(cname, city) 1000000 products 1000 companies How much time do the following physical operators take if the data is in main memory ? Nested loop join time = Sort and merge = merge-join time = Hash join time =

Cost Parameters The cost of an operation = total number of I/Os result assumed to be delivered in main memory Cost parameters: B(R) = number of blocks for relation R T(R) = number of tuples in relation R V(R, a) = number of distinct values of attribute a M = size of main memory buffer pool, in blocks

Cost Parameters Clustered table R: Unclustered table R: Blocks consists only of records from this table B(R) << T(R) Unclustered table R: Its records are placed on blocks with other tables B(R)  T(R) When a is a key, V(R,a) = T(R) When a is not a key, V(R,a)

Selection and Projection Selection s(R), projection P(R) Both are tuple-at-a-time algorithms Cost: B(R) Unary operator Input buffer Output buffer

Main Memory Hash Join Hash join: R |x| S Scan S, build buckets in main memory Then scan R and join Cost: B(R) + B(S) Assumption: B(S) <= M

Duplicate Elimination Duplicate elimination d(R) Hash table in main memory Cost: B(R) Assumption: B(d(R)) <= M

Grouping Grouping: Product(name, department, quantity) gdepartment, sum(quantity) (Product)  Answer(department, sum) Main memory hash table Question: How ?

Nested Loop Joins for each tuple r in R do for each tuple s in S do Tuple-based nested loop R ⋈ S Cost: T(R) B(S) when S is clustered Cost: T(R) T(S) when S is unclustered for each tuple r in R do for each tuple s in S do if r and s join then output (r,s)

Nested Loop Joins We can be much more clever Question: how would you compute the join in the following cases ? What is the cost ? B(R) = 1000, B(S) = 2, M = 4 B(R) = 1000, B(S) = 3, M = 4 B(R) = 1000, B(S) = 6, M = 4

Nested Loop Joins Block-based Nested Loop Join for each (M-2) blocks bs of S do for each block br of R do for each tuple s in bs for each tuple r in br do if “r and s join” then output(r,s)

Hash table for block of S Nested Loop Joins R & S Join Result Hash table for block of S (M-2 pages) . . . . . . . . . Input buffer for R Output buffer

Nested Loop Joins Block-based Nested Loop Join Cost: Read S once: cost B(S) Outer loop runs B(S)/(M-2) times, and each time need to read R: costs B(S)B(R)/(M-2) Total cost: B(S) + B(S)B(R)/(M-2) Notice: it is better to iterate over the smaller relation first R |x| S: R=outer relation, S=inner relation

Partitioned Hash Algorithms Idea: partition a relation R into buckets, on disk Each bucket has size approx. B(R)/M M main memory buffers Disk Relation R OUTPUT 2 INPUT 1 hash function h M-1 Partitions . . . 1 2 B(R) Does each bucket fit in main memory ? Yes if B(R)/M <= M, i.e. B(R) <= M2

Duplicate Elimination Recall: d(R) = duplicate elimination Step 1. Partition R into buckets Step 2. Apply d to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M2

Grouping Recall: g(R) = grouping and aggregation Step 1. Partition R into buckets Step 2. Apply g to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M2

Partitioned Hash Join R |x| S Step 1: Step 2 Step 3 Hash S into M buckets send all buckets to disk Step 2 Hash R into M buckets Send all buckets to disk Step 3 Join every pair of buckets

Hash table for partition Hash-Join B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h M-1 Partitions . . . Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. Partitions of R & S Input buffer for Ri Hash table for partition Si ( < M-1 pages) B main memory buffers Disk Output buffer Join Result hash fn h2 Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. 14

Partitioned Hash Join Cost: 3B(R) + 3B(S) Assumption: min(B(R), B(S)) <= M2

External Sorting Problem: Sort a file of size B with memory M Where we need this: ORDER BY in SQL queries Several physical operators Bulk loading of B+-tree indexes. Will discuss only 2-pass sorting, for when B < M2 4

External Merge-Sort: Step 1 Phase one: load M bytes in memory, sort M . . . . . . Disk Disk Main memory Runs of length M bytes

External Merge-Sort: Step 2 Merge M – 1 runs into a new run Result: runs of length M (M – 1) M2 Input 1 . . . Input 2 . . . Output . . . . Input M Disk Disk Main memory If B <= M2 then we are done 7

Cost of External Merge Sort Read+write+read = 3B(R) Assumption: B(R) <= M2 8

Duplicate Elimination Duplicate elimination d(R) Idea: do a two step merge sort, but change one of the steps Question in class: which step needs to be changed and how ? Cost = 3B(R) Assumption: B(d(R)) <= M2 Step 2: merge M-1 runs, but include each tuple only once cost B(R)

Grouping Grouping: ga, sum(b) (R) Same as before: sort, then compute the sum(b) for each group of a’s Total cost: 3B(R) Assumption: B(R) <= M2

Merge-Join Join R |x| S Step 1a: initial runs for R Step 1b: initial runs for S Step 2: merge and join

Merge-Join . . . . . . M1 = B(R)/M runs for R M2 = B(S)/M runs for S Input 1 . . . Input 2 . . . Output . . . . Input M Disk Disk Main memory M1 = B(R)/M runs for R M2 = B(S)/M runs for S If B <= M2 then we are done 7

Two-Pass Algorithms Based on Sorting Join R |x| S If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase Total cost: 3B(R)+3B(S) Assumption: B(R) + B(S) <= M2

Index Based Selection Selection on equality: sa=v(R) Clustered index on a: cost B(R)/V(R,a) Unclustered index on a: cost T(R)/V(R,a)

Index Based Selection B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan (assuming R is clustered): B(R) = 2,000 I/Os Index based selection: If index is clustered: B(R)/V(R,a) = 100 I/Os If index is unclustered: T(R)/V(R,a) = 5,000 I/Os Lesson: don’t build unclustered indexes when V(R,a) is small ! cost of sa=v(R) = ?

Index Based Join R S Assume S has an index on the join attribute Iterate over R, for each tuple fetch corresponding tuple(s) from S Assume R is clustered. Cost: If index is clustered: B(R) + T(R)B(S)/V(S,a) If index is unclustered: B(R) + T(R)T(S)/V(S,a)

Index Based Join Assume both R and S have a sorted index (B+ tree) on the join attribute Then perform a merge join called zig-zag join Cost: B(R) + B(S)

Summary of External Join Algorithms Block Nested Loop: B(S) + B(R)*B(S)/M Partitioned Hash: 3B(R)+3B(S); min(B(R),B(S)) <= M2 Merge Join: 3B(R)+3B(S B(R)+B(S) <= M2 Index Join: B(R) + T(R)B(S)/V(S,a)