Query and Join Optimization 11/5. Overview Recap of Merge Join Optimization Logical Optimization Histograms (How Estimates Work. Big problem!) Physical.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Query Optimization Goal: Declarative SQL query
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
1 Relational Query Optimization Module 5, Lecture 2.
Relational Query Optimization 198:541. Overview of Query Optimization  Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
Relational Query Optimization (this time we really mean it)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Overview of Query Evaluation R&G Chapter 12 Lecture 13.
Query Optimization II R&G, Chapters 12, 13, 14 Lecture 9.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 1, 2005 Some slide content derived.
CS186 Final Review Query Optimization.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Overview of Query Optimization v Plan : Tree of R.A. ops, with choice of alg for each op. –Each operator typically implemented using a `pull’ interface:
Query Optimization R&G, Chapter 15 Lecture 16. Administrivia Homework 3 available today –Written exercise; will be posted on class website –Due date:
1 Implementation of Relational Operations: Joins.
Overview of Implementing Relational Operators and Query Evaluation
Introduction to Database Systems1 Relational Query Optimization Query Processing: Topic 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
1 Overview of Query Evaluation Chapter Overview of Query Evaluation  Plan : Tree of R.A. ops, with choice of alg for each op.  Each operator typically.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Implementation of Database Systems, Jarek Gryz1 Relational Query Optimization Chapters 12.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Introduction to Query Optimization
Evaluation of Relational Operations
Introduction to Database Systems
Examples of Physical Query Plan Alternatives
Relational Operations
Query Optimization Overview
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #12 Joins!
CS222: Principles of Data Management Notes #12 Joins!
CMPT 354: Database System I
Relational Query Optimization
Overview of Query Evaluation
Implementation of Relational Operations
Relational Query Optimization
Overview of Query Evaluation: JOINS
Overview of Query Evaluation
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Relational Query Optimization
Relational Query Optimization
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

Query and Join Optimization 11/5

Overview Recap of Merge Join Optimization Logical Optimization Histograms (How Estimates Work. Big problem!) Physical Optimizer (if we have time)

Recap on Merge

Key (Simple) Idea To find an element that is no larger than all elements in two lists, one only needs to compare minimum elements from each list. A 1 <= A 2 <= … <= A N B 1 <= B 2 <= … <= B M Then Min {A 1, B 1 } <= A i for i=1….N and Min {A 1, B 1 } <= B j for j=1….M A 1 <= A 2 <= … <= A N B 1 <= B 2 <= … <= B M Then Min {A 1, B 1 } <= A i for i=1….N and Min {A 1, B 1 } <= B j for j=1….M

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 7,111, 520,31 2, 2225,3023,24 Main Memory Two Sorted Files (disk)

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 7,1120,31 25,3023,24 Main Memory Two Sorted Files (disk) 1, 5 2, 22

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 7,1120,31 25,3023,24 Main Memory Two Sorted Files (disk) 1,5 2,22

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 7,1120,31 25,3023,24 Main Memory Two Sorted Files (disk) ,2

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 7,1120,31 25,3023,24 Main Memory Two Sorted Files (disk) ,2

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 7,1120,31 25,3023,24 Main Memory Two Sorted Files (disk) 22 1,25 What next?

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 20,31 25,3023,24 Main Memory Two Sorted Files (disk) 22 1,25 7,11

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 20,31 25,3023,24 Main Memory Two Sorted Files (disk) 7, ,25

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 20,31 25,3023,24 Main Memory Two Sorted Files (disk) ,25,7

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 20,31 25,3023,24 Main Memory Two Sorted Files (disk) ,2 5,7

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 20,31 25,3023,24 Main Memory Two Sorted Files (disk) 22 1,211 5,7

Merge BIG sorted files to produce BIGGER Sorted Files With SMALL memory 25,3023,24 Main Memory Two Sorted Files (disk) 20, ,211 5,7

We can merge lists of arbitrary length with only 3 buffer pages. If Lists of size N and M, then Cost: 2(N+M) if lists of size N,M. What if we merge B lists with B+1 buffer pages?

Query Optimization

Optimization Order the operations within a query to reduce the cost. Major component of the database – Most mysterious & important – Heart of Query Processing (QP) “QP is not rocket science. When you flunk out of QP, we make you go build rockets.” – anonymous

Join Optimization

RA Reminder Sailors(sid,sname,rating,age) Reserves(sid,bid,date) Boats(bid,bname,color) Warning: Keys… “Find Names of sailors who’ve reserved a red boat” “Find the names of sailors who reserved both a red boat and a green boat”

Schema for Examples Reserves: – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. Sailors: – Each tuple is 50 bytes long, 80 tuples per page, 500 pages. Sailors ( sid : integer, sname : string, rating : integer, age : real) Reserves ( sid : integer, bid : integer, day : dates, rname : string)

What is this doing? You too can type EXPLAIN! (you may also want to know ANALYZE) When it’s slow, you’d like to know!

Joins One of the most important for performance Many, many algorithms: All fun. SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid = S1.sid SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid = S1.sid What is this in RA?

Some dry notation Given Relation R. Define the following two functions. T(R) = “# of tuples in R” B(R) = “# of pages/blocks in R” NB: I omit ceiling in calculations. A good exercise is to put them in the appropriate places! NB2: We don’t write the output writing to disk cost!

26 Nested loop join

27 Nested Loop Joins Tuple-based nested loop R S for each tuple r in R do for each tuple s in S do if r and s join then output (r,s) B(R) = 500 T(R) = 50,000 B(S) = 1000 T(S) =200,000 then, 5e7 IOs. ~ 140 hours! B(R) = 500 T(R) = 50,000 B(S) = 1000 T(S) =200,000 then, 5e7 IOs. ~ 140 hours! Cost: B(R) + T(R) B(S). Why? What is the cost if we switch the R and S?

28 Block Nested Loop Joins for each (M-1) blocks br of R do for each block bs of S do for each tuple s in bs do for each tuple r in br do if r and s join then output(r,s) Let M be the number of blocks in memory (M=11) B(R) = 500 T(R) = 50,000 B(S) = 1000 T(S) =200,000 NLJ =140 hrs BNLJ=.14 hrs B(R) = 500 T(R) = 50,000 B(S) = 1000 T(S) =200,000 NLJ =140 hrs BNLJ=.14 hrs NLJ = B(R) + T(R)B(S) BNLJ = B(R) + B(R)B(S)/(M-1) NLJ = B(R) + T(R)B(S) BNLJ = B(R) + B(R)B(S)/(M-1)

29 Nested Loop Joins Block-based Nested Loop Join – Still a smart cross product. Nevertheless, useful! NB: it is faster to iterate over the smaller relation first R S: R=outer relation, S=inner relation

Smarter than Cross Products

31 Index Nested Loop Joins Index -based nested loop R S on A for each tuple r in R do for each tuple s find all s.t. r.A = s.A Clustered B+ tree on S.A. All distinct values fit on a page. How much does this join cost? ~ B(R) + T(R)*3 (rule of thumb) Clustered B+ tree on S.A. All distinct values fit on a page. How much does this join cost? ~ B(R) + T(R)*3 (rule of thumb) Does not evaluate the full cross product!

Sort Merge

Join: Sort-Merge (R S) Sort R and S on the join column, then scan them to do a ``merge’’ (on join col.), and output result tuples. R is scanned once; each S group is scanned once per matching R tuple. – Multiple scans of an S group are likely to find needed pages in buffer. If R, S are already sorted on the join key, SMJ is awesome! If R, S are already sorted on the join key, SMJ is awesome!

Example of Sort-Merge Join Cost: 6 M + 6N + (M+N) – The cost of scanning, M+N, could be M*N (very unlikely! When does this happen?) – Here M (resp. N) is the size in Pages of R (resp. N)

Sort Merge v. Nested Loops steel cage match If we have 100 buffer pages, reserves is 1000 pages and Sailors 500 pages then – Sort both in two passes: 2 * 2 * * 2 * 500 – Merge phase so 7500 IOs What is BNLJ? – *500/99 = 5550 But, if we have 35 buffer pages? – Sort Merge has same behavior (still 2 pass) – BNLJ? ~ 15k IOs! NB: SMJ both relations sorted in two passes

A simple optimization: Merged! Observe. The last phase of the external sort is a merge, and we can merge the merge phases. – Create sorted runs 2 * ( ) Each run is of length (B-1) (approximately) There are 1000/(B-1) + 500/(B-1) such runs – If ( )/(2(B-1)) < B-1 then all runs fit in memory, roughly if (M+N) < 2B 2 or max { M, N } < 2B 2 – One can create runs of length 2(B-1) using what’s called a tournament sort used in PostgreSQL So we’ll say max { M, N } < B 2 implies cost 3(M+N)

Hash Join

Hash-Join Partitions of R & S Input buffer for Si Hash table for partition Ri (k < B-1 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1... (2) Read a partition of R, hash it using h2. Scan matching partition of S for matches. (2) Read a partition of R, hash it using h2. Scan matching partition of S for matches. (1) Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. Should h = h2?

How much memory does Hash join need to perform well? Good case=perform the join in 2 passes 1 st Point: How large are the partitions? – R is of size M – S is of size N (wlog M < N) – Partition R into B-1 buffer pages (why B-1?) How many partitions result? How big are they? Roughly, each partition of R is f M/(B-1) where f is some fudge factor. Roughly, each partition of R is f M/(B-1) where f is some fudge factor.

How much memory does Hash join need to perform well? Good case=perform the join in 2 passes 2 nd Question: During the probe phase, how much memory do we need? Key :Only smaller partition needs to fit! Buffer needs to fit 1 partition of R, 1 page of S, & output: B > f M / (B-1) i.e., B 2 > fM The little dog!

Sort-Merge v. Hash Join In partitioning phase, read+write both R,S; 2(M+N). In matching phase, read both R,S; M+N I/Os. Given a minimum amount of memory (what is this, for each?) both have a cost of 3(M+N) I/Os. Minimum memory: HJ : B 2 > min {M,N} pages – i.e., the smaller relation SMJ: B 2 > max {M,N} pages – i.e., the larger relation. Minimum memory: HJ : B 2 > min {M,N} pages – i.e., the smaller relation SMJ: B 2 > max {M,N} pages – i.e., the larger relation. Hash Join superior if relation sizes differ greatly. Why?

Further Comparisons of Hash and Sort Joins Hash Joins are highly parallelizable. Sort-Merge less sensitive to data skew and result is sorted

Observations about Hash-Join In-memory hash table speeds up matching tuples, so little more memory is needed (fudge factor). If the hash function does not partition uniformly, one or more R partitions may not fit in memory. – What then? Can apply hash-join technique recursively to do the join of this R-partition with corresponding S- partition. – SKEW!

Recall: Logical Optimization

Single block SQL to RA SELECT DISTINCT S.sid FROM Sailors S, Reserves R WHERE s.sid = r.sid and s.rating > 8 SELECT DISTINCT S.sid FROM Sailors S, Reserves R WHERE s.sid = r.sid and s.rating > 8 SELECT S.sid, COUNT(DISTINCT Bid) FROM Sailors S, Reserves R WHERE s.sid = r.sid and s.rating > 8 GROUP BY S.sid HAVING COUNT(DISTINCT Bid) > 5 SELECT S.sid, COUNT(DISTINCT Bid) FROM Sailors S, Reserves R WHERE s.sid = r.sid and s.rating > 8 GROUP BY S.sid HAVING COUNT(DISTINCT Bid) > 5 Highly rated sailors who reserve many different boats and how many boats they reserve How would you optimize these? Highly rated sailors

Logical Optimization Summary Use query equivalence to compute same output via different plans – Key reason to use an algebra Often logical rewritings applied heuristically: – Always convert selection + cross product to Join Asymptotic reduction – Push down selections and projections Often, but not always a good idea!

Physical Optimization

One concept: Pipelining Intermediate results: could write them to disk or pipeline them to next operator Reserves Sailors sid=sid bid=100 rating > 5 sname RA Tree: We can apply selection & projection “on the fly”. Why?

Overview of Query Optimization A Plan is Tree of R.A. ops with choice of algorithm for each op. – Each operator typically implemented using a `pull’ interface: Two main issues: – For a given query, what plans are considered? Algorithm to search plan space for cheapest (estimated) plan. – How is the cost of a plan estimated? Ideally: Want to find best plan. Practically: Avoid worst plans! We will study the System R approach.

Highlights of System R Optimizer Impact: Most widely used; works well for < 10 joins. Cost estimation: Approximate art at best. – Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. – Considers combination of CPU and I/O costs. Enumerates an entire plan space: – Too many plans so only left-deep plans considered. Left-deep plans allow output of each operator to be pipelined into the next operator – Cartesian products avoided. There are other styles now… rule based.

An Example

How does it get those costs?

Cost Estimation

For each plan considered, must estimate cost: – Must estimate cost of each operation in plan tree. You know or can guess this: We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.) All estimates depend on input cardinality… – Must also estimate size of result for each operation in tree! For selections and joins, assume independence of predicates. Let’s see how to estimate…

Estimating Results Sizes Estimate the reduction factor Column = value (e.g. Salary = 100k) – If there is an index I then 1/#keys(I) – If no index? 1/10 Column1 = Column2 (e.g. Salary = Age ) – Index I1, I2, then 1/max(#keys(I1), #keys(I2)) – If no Index? 1/10 Column > value – If Index I then (High(I) – value) / (High(I) – Low(I)) – No Index? 1/2 Later: Do better with histograms

Histograms

A histogram idea is to make “buckets” count how many are in each bucket How to choose the buckets? – Equiwidth & Equidepth Turns out high-frequency values are very important

Abstract Example Values Frequency How do we compute how many values between 8 and 10? (Yes, it’s obvious) Problem: Counts take too much space!

The Uniform is Red How much space does this take to store?

Fundamental Tradeoffs Want high resolution (like the full counts) Want low space (like uniform) Histograms are a compromise!

The Uniform is Red A query How do you estimate # of tuples? What about point queries?

Equi width All buckets roughly the same width

Equidepth All buckets contain roughly the same number of items

Range Query: x in [5,8] All buckets roughly the same width

Histograms Simple, intuitive and popular Parameters # of buckets and type Can extend to many attributes (multidimensional)

Maintaining Histograms Histograms require that we update them! – Typically, you must run/schedule a command to update statistics on the database – Out of date histograms can be terrible! There is research work on self-tuning histograms and the use of query feedback – Oracle 11g

Nasty example 1. we insert many tuples with value > we do not update the histogram 3. we ask for values > 20? 1. we insert many tuples with value > we do not update the histogram 3. we ask for values > 20?

When estimates behave badly If we underestimate the number of tuples, what kinds of plans suffer? If we overestimate the number of tuples, what kinds of plans suffer? Think about using unclustered indexes…. We could have used that index! Or we could have used a hash join in one pass instead of sorting in two!

Compressed Histograms One popular approach: 1.Store the most frequent values and their counts explicitly 2.Keep an equiwidth or equidepth one for the rest of the values People continue to try all manner of fanciness here that people try wavelets, graphical models, entropy models,…