April 27th – Cost Estimation

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
1 Lecture 23: Query Execution Friday, March 4, 2005.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Lecture 9 Query Optimization November 24, 2010 Dan Suciu -- CSEP544 Fall
1 Implementation of Relational Operations Module 5, Lecture 1.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Lecture 24: Query Execution Monday, November 20, 2000.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
1 Optimization Recap and examples. 2 Optimization introduction For every SQL expression, there are many possible ways of implementation. The different.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
1 Implementation of Relational Operations: Joins.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
CSE544 Query Optimization Tuesday-Thursday, February 8 th -10 th, 2011 Dan Suciu , Winter
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 Lecture 23: Query Execution Monday, November 26, 2001.
CS 540 Database Management Systems
CS 440 Database Management Systems
Query Processing Exercise Session 4.
Database Management System
Database Systems (資料庫系統)
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Evaluation of Relational Operations
Cse 344 February 14th – Indexing.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Cse 344 April 25th – Disk i/o.
February 12th – RDBMS Internals
April 20th – RDBMS Internals
Relational Operations
Yan Huang - CSCI5330 Database Implementation – Access Methods
Query processing and optimization
Example R1 R2 over common attribute C
February 16th – Disk i/o and estimation
Chapters 15 and 16b: Query Optimization
Lecture 2- Query Processing (continued)
Implementation of Relational Operations
Lecture 24: Query Execution
Lecture 13: Query Execution
Query Execution Index Based Algorithms (15.6)
Lecture 23: Query Execution
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CSE 444: Lecture 25 Query Execution
Lecture 22: Query Execution
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

April 27th – Cost Estimation Cse 344 April 27th – Cost Estimation

Administrivia HW5 Out Please verify that you can run queries Midterm May 9th 9:30-10:20 – MLR 301 Review (in class) – May 7th Practice exam – May 4th Through parallelism: next week’s material

Index Based Selection Example: Table scan: B(R) = 2,000 I/Os B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: B(R) = 2,000 I/Os Index based selection: If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os cost of sa=v(R) = ? Lesson: Don’t build unclustered indexes when V(R,a) is small !

Outline Join operator algorithms One-pass algorithms (Sec. 15.2 and 15.3) Index-based algorithms (Sec 15.6) Note about readings: In class, we discuss only algorithms for joins Other operators are easier: read the book

Join Algorithms Hash join Nested loop join Sort-merge join

Hash Join Hash join: R ⋈ S Scan R, build buckets in main memory Then scan S and join Cost: B(R) + B(S) Which relation to build the hash table on? CSE 344 - 2017au

Hash Join Hash join: R ⋈ S Scan R, build buckets in main memory Then scan S and join Cost: B(R) + B(S) Which relation to build the hash table on? One-pass algorithm when B(R) ≤ M M = number of memory pages available

Hash Join Example Patient(pid, name, address) Insurance(pid, provider, policy_nb) Patient Insurance Two tuples per page Patient Insurance 1 ‘Bob’ ‘Seattle’ 2 ‘Blue’ 123 2 ‘Ela’ ‘Everett’ 4 ‘Prem’ 432 3 ‘Jill’ ‘Kent’ 4 ‘Prem’ 343 4 ‘Joe’ ‘Seattle’ 3 ‘GrpH’ 554

Hash Join Example Patient Insurance Patient Insurance Some large-enough # Patient Insurance Memory M = 21 pages Showing pid only Disk Patient Insurance 1 2 2 4 6 6 3 4 4 3 1 3 9 6 2 8 This is one page with two tuples 8 5 8 9

Hash Join Example Step 1: Scan Patient and build hash table in memory Can be done in method open() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 1 2 1 2 2 4 6 6 Input buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Write to disk or pass to next operator Hash Join Example Step 2: Scan Insurance and probe into hash table Done during calls to next() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 2 4 1 2 2 1 2 2 4 6 6 Input buffer Output buffer 3 4 4 3 1 3 Write to disk or pass to next operator 9 6 2 8 8 5 8 9

Hash Join Example Step 2: Scan Insurance and probe into hash table Done during calls to next() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 2 4 1 2 4 1 2 2 4 6 6 Input buffer Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Hash Join Example Step 2: Scan Insurance and probe into hash table Done during calls to next() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 4 3 1 2 4 Say “close()” deletes the hash table 1 2 2 4 6 6 Input buffer Output buffer 3 4 4 3 1 3 Keep going until read all of Insurance 9 6 2 8 Cost: B(R) + B(S) 8 5 8 9

Nested Loop Joins Tuple-based nested loop R ⋈ S R is the outer relation, S is the inner relation for each tuple t1 in R do for each tuple t2 in S do if t1 and t2 join then output (t1,t2) What is the Cost?

Nested Loop Joins Tuple-based nested loop R ⋈ S R is the outer relation, S is the inner relation Cost: B(R) + T(R) B(S) Multiple-pass since S is read many times for each tuple t1 in R do for each tuple t2 in S do if t1 and t2 join then output (t1,t2) What is the Cost?

Page-at-a-time Refinement for each page of tuples r in R do for each page of tuples s in S do for all pairs of tuples t1 in r, t2 in s if t1 and t2 join then output (t1,t2) Cost: B(R) + B(R)B(S) What is the Cost?

Page-at-a-time Refinement 1 2 Input buffer for Patient 2 4 Input buffer for Insurance Disk Patient Insurance 2 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Page-at-a-time Refinement 1 2 1 2 Input buffer for Patient 4 3 Input buffer for Insurance Disk Patient Insurance 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Page-at-a-time Refinement 1 2 1 2 Input buffer for Patient 2 8 Input buffer for Insurance Disk Patient Insurance Keep going until read all of Insurance 2 1 2 2 4 6 6 Output buffer Then repeat for next page of Patient… until end of Patient 3 4 4 3 1 3 9 6 2 8 Cost: B(R) + B(R)B(S) 8 5 8 9

Block-Nested-Loop Refinement for each group of M-1 pages r in R do for each page of tuples s in S do for all pairs of tuples t1 in r, t2 in s if t1 and t2 join then output (t1,t2) Stopped here Winter 2017 Cost: B(R) + B(R)B(S)/(M-1) What is the Cost?

Sort-Merge Join Sort-merge join: R ⋈ S Scan R and sort in main memory Scan S and sort in main memory Merge R and S Cost: B(R) + B(S) One pass algorithm when B(S) + B(R) <= M Typically, this is NOT a one pass algorithm

Sort-Merge Join Example Step 1: Scan Patient and sort in memory Memory M = 21 pages 1 2 4 3 9 6 8 5 Disk Patient Insurance 1 2 2 4 6 6 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Sort-Merge Join Example Step 2: Scan Insurance and sort in memory Memory M = 21 pages 1 2 3 4 5 6 8 9 1 2 2 3 3 4 4 6 Disk 6 8 8 9 Patient Insurance 1 2 2 4 6 6 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Sort-Merge Join Example Step 3: Merge Patient and Insurance Memory M = 21 pages 1 2 3 4 5 6 8 9 1 2 2 3 3 4 4 6 Disk 6 8 8 9 Patient Insurance 1 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Sort-Merge Join Example Step 3: Merge Patient and Insurance Memory M = 21 pages 1 2 3 4 5 6 8 9 1 2 2 3 3 4 4 6 Disk 6 8 8 9 Patient Insurance 2 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 Keep going until end of first relation 9 6 2 8 8 5 8 9

Index Nested Loop Join R ⋈ S Assume S has an index on the join attribute Iterate over R, for each tuple fetch corresponding tuple(s) from S Cost: If index on S is clustered: B(R) + T(R) * (B(S) * 1/V(S,a)) If index on S is unclustered: B(R) + T(R) * (T(S) * 1/V(S,a))

Logical Query Plan 1 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 1 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ σpno=2∧scity=‘Seattle’∧sstate=‘WA’ sid = sid Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 1 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 1 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ σpno=2∧scity=‘Seattle’∧sstate=‘WA’ T = 10000 sid = sid Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 1 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 1 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ T < 1 σpno=2∧scity=‘Seattle’∧sstate=‘WA’ T = 10000 sid = sid Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 2 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ sid = sid σscity=‘Seattle’∧sstate=‘WA’ σpno=2 Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 2 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ sid = sid T= 5 T = 4 σscity=‘Seattle’∧sstate=‘WA’ σpno=2 Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 2 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ sid = sid T= 5 T = 4 Very wrong! Why? σscity=‘Seattle’∧sstate=‘WA’ σpno=2 Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 2 πsname SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ T = 4 sid = sid T= 5 T = 4 Very wrong! Why? σscity=‘Seattle’∧sstate=‘WA’ σpno=2 Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Logical Query Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Logical Query Plan 2 πsname Different estimate  SELECT sname FROM Supplier x, Supply y WHERE x.sid = y.sid and y.pno = 2 and x.scity = ‘Seattle’ and x.sstate = ‘WA’ T = 4 sid = sid T= 5 T = 4 Very wrong! Why? σscity=‘Seattle’∧sstate=‘WA’ σpno=2 Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 1 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 1 πsname T < 1 σpno=2∧scity=‘Seattle’∧sstate=‘WA’ T = 10000 Total cost: 100/10 * 100 = 1000 sid = sid Block nested loop join Scan Supply Scan Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 1 Cost: B(R) + B(R)B(S)/(M-1) Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 1 πsname T < 1 σpno=2∧scity=‘Seattle’∧sstate=‘WA’ T = 10000 Total cost: 100+100*100/10 = 1100 sid = sid Cost: B(R) + B(R)B(S)/(M-1) Block nested loop join Scan Supply Scan Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 2 πsname Cost of Supply(pno) = 4 Cost of Supplier(scity) = 50 Total cost: 54 T = 4 sid = sid T= 5 T = 4 Main memory join σsstate=‘WA’ σpno=2 T= 50 Unclustered index lookup Supply(pno) σscity=‘Seattle’ Unclustered index lookup Supplier(scity) Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 2 πsname Cost of Supply(pno) = 4 Cost of Supplier(scity) = 50 Total cost: 54 T = 4 sid = sid T= 5 T = 4 Main memory join σsstate=‘WA’ σpno=2 T= 50 Unclustered index lookup Supply(pno) σscity=‘Seattle’ Unclustered index lookup Supplier(scity) Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 2 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 2 πsname Cost of Supply(pno) = 4 Cost of Supplier(scity) = 50 Total cost: 54 T = 4 sid = sid T= 5 T = 4 Main memory join σsstate=‘WA’ σpno=2 T= 50 Unclustered index lookup Supply(pno) σscity=‘Seattle’ Unclustered index lookup Supplier(scity) Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 3 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 3 πsname T = 4 σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4 Cost of Index join = 4 Total cost: 8 sid = sid T = 4 Clustered Index join σpno=2 Unclustered index lookup Supply(pno) Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 3 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 3 πsname T = 4 σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4 Cost of Index join = 4 Total cost: 8 sid = sid T = 4 Clustered Index join σpno=2 Unclustered index lookup Supply(pno) Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Physical Plan 3 Supplier(sid, sname, scity, sstate) Supply(sid, pno, quantity) Physical Plan 3 πsname T = 4 σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4 Cost of Index join = 4 Total cost: 8 sid = sid T = 4 Clustered Index join σpno=2 Unclustered index lookup Supply(pno) Supply Supplier T(Supplier) = 1000 B(Supplier) = 100 V(Supplier, scity) = 20 V(Supplier, state) = 10 T(Supply) = 10000 B(Supply) = 100 V(Supply, pno) = 2500 M=11

Query Optimizer Summary Input: A logical query plan Output: A good physical query plan Basic query optimization algorithm Enumerate alternative plans (logical and physical) Compute estimated cost of each plan Choose plan with lowest cost This is called cost-based optimization

Disk Scheduling Query optimization Good DB design Good estimation Hardware independent All Disk I/Os are not created equal Sectors close to each other are more preferable to read