Database Management Systems (CS 564)

Slides:



Advertisements
Similar presentations
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
CS 540 Database Management Systems
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
Query Optimization Goal: Declarative SQL query
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Relational Query Optimization Module 5, Lecture 2.
Relational Query Optimization 198:541. Overview of Query Optimization  Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically.
Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.
Relational Query Optimization (this time we really mean it)
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 1, 2005 Some slide content derived.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
Overview of Query Optimization v Plan : Tree of R.A. ops, with choice of alg for each op. –Each operator typically implemented using a `pull’ interface:
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Overview of Implementing Relational Operators and Query Evaluation
Introduction to Database Systems1 Relational Query Optimization Query Processing: Topic 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Implementation of Database Systems, Jarek Gryz1 Relational Query Optimization Chapters 12.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 16: Relational Operators
Introduction to Query Optimization
Overview of Query Optimization
COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
Examples of Physical Query Plan Alternatives
File Processing : Query Processing
Query Optimization Overview
Relational Query Optimization
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Query Optimization Overview
Query Optimization.
Lecture 2- Query Processing (continued)
Relational Query Optimization
Overview of Query Evaluation
Implementation of Relational Operations
Lecture 13: Query Execution
Relational Query Optimization
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Database Applications (15-415) DBMS Internals- Part X Lecture 22, April 3, 2018 Mohammad Hammoud.
Relational Query Optimization (this time we really mean it)
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Relational Query Optimization
Relational Query Optimization
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Database Management Systems (CS 564) Fall 2017 Lecture 24

Disaster-prevention mode! Query Optimization Disaster-prevention mode! CS 564 (Fall'17)

Recap Memory hierarchy and secondary storage Buffer management and replacement policies File, page and record organizations Indexing and access paths Evaluating relational operators Query optimization Putting all the above together to efficiently answer queries CS 564 (Fall'17)

Example Logical Query Plan Physical Query Plans 𝜋buyer SELECT DISTINCT P.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘Madison’ ORDER BY P.buyer 𝜋buyer 𝜎city = ‘Madison’ ⨝buyer=name Purchase Person 𝜏buyer Assume that Purchase has a B+tree index on buyer and Person has a B+tree index on city Logical Query Plan Hash Join Table Scan B+tree Scan (Index Leaf Nodes) Hash-based Projection Person Purchase External Merge-sort Quicksort for Internal, B=20 Index Nested Loop Join Indexed Scan Table Scan Sort-based Projection Person Purchase Physical Query Plans CS 564 (Fall'17)

Query Optimization A given logical query plan (LQP) can have several possible physical query plans (PQP) with (potentially) different costs Ideal goal: for any given LQP Create the space of all possible PQPs Estimate the cost of each plan Pick the optimal (i.e. cheapest/fastest) PQP CS 564 (Fall'17)

Query Optimization (Cont.) Real-world goal: just prevent the disaster; i.e. avoid the clearly awful PQPs “Query optimization is a metaphor for life itself! It is hard to even know what an optimal plan would be, but it is feasible to avoid the obviously bad plans!” –Jeff Naughton CS 564 (Fall'17)

Query Execution Query Parser Query Optimizer Catalog SQL Query Query Parser LQP Query Optimizer Catalog Plan Generator Plan Cost Estimator Optimized PQP Query Plan Evaluator Operator Evaluators CS 564 (Fall'17)

Plan Generator Also called plan enumerator Explores various PQPs for a given LQP Challenge: space of all possible plans is huge Use rules to help determine what plans to enumerate, and also consults cost models Two main types of rules for enumerating plans Logical: algebraic rewrites Use relational algebraic equivalence to rewrite LQP itself Physical: various physical implementations of operations Use different implementations for a given logical operation in LQP CS 564 (Fall'17)

Algebraic Rewriting Rewrite a given RA query into another that is equivalent (a logical property) but might be faster (a physical property) Exploit formal properties of RA operators A few rewrite rules Single-operator Rewrites Unary operators Binary operators Cross-operator Rewrites CS 564 (Fall'17)

Algebraic Rewriting: Example 𝜋buyer 𝜎city = ‘Madison’ ⨝buyer=name Purchase Person 𝜏buyer 𝜎total > 25 𝜋buyer 𝜎total > 25 ⨝buyer=name Purchase Person 𝜏buyer 𝜎 city = ‘Madison’ 𝜋buyer 𝜎city = ‘Madison’ ∧ total > 25 ⨝buyer=name Purchase Person 𝜏buyer 𝜋buyer 𝜎city = ‘Madison’ ⨝buyer=name Purchase Person 𝜏buyer 𝜎total > 25 𝜋buyer 𝜎total > 25 ⨝buyer=name Purchase Person 𝜏buyer 𝜎 city = ‘Madison’ CS 564 (Fall'17)

Algebraic Rewriting: Unary Operators Key unary operators in RA Selection (𝜎) Projection (𝜋) Commutativity of 𝜎 𝜎 𝑝 1 ( 𝜎 𝑝 2 (𝑅))= 𝜎 𝑝 2 ( 𝜎 𝑝 1 (𝑅)) Cascading of 𝜎 𝜎 𝑝 1 ( 𝜎 𝑝 2 (… 𝜎 𝑝 𝑛 (𝑅)…))= 𝜎 𝑝 1 ∧ 𝑝 2 ∧ … ∧ 𝑝 𝑛 (𝑅) Cascading of 𝜋: given 𝐴 𝑖 ⊆ 𝐴 𝑖+1 ; ∀𝑖=1,…, 𝑛−1 𝜋 𝐴 1 ( 𝜋 𝐴 2 (… 𝜋 𝐴 𝑛 (𝑅)…))= 𝜋 𝐴 1 (𝑅) CS 564 (Fall'17)

Algebraic Rewriting: Binary Operators Key binary operator in RA: join (⋈) Commutativity of ⋈ 𝑅⋈𝑆=𝑆⋈𝑅 Associativity of ⋈ (𝑅⋈𝑆)⋈𝑇=𝑅⋈(𝑆⋈𝑇) Other binary operators Cartesian product and set operators Which ones satisfy the above properties? CS 564 (Fall'17)

Cross-operator Algebraic Rewriting Commuting 𝜎 and 𝜋, if they are on the same set A of attributes 𝜎 𝑝 𝐴 ( 𝜋 𝐴 (𝑅))= 𝜋 𝐴 ( 𝜎 𝑝(𝐴) (𝑅)) Combining 𝜎 and × 𝜎 𝑝 𝑅×𝑆 =𝑅 ⋈ 𝑝 𝑆 “Pushing the selection” 𝜎 𝑝 𝐴 (𝑅⋈𝑆)= 𝜎 𝑝(𝐴) (𝑅)⋈𝑆 if 𝐴⊆𝑅 𝜎 𝑝 𝐴 𝑅 × 𝑆 = 𝜎 𝑝 𝐴 𝑅 × 𝑆 if 𝐴⊆𝑅 Commuting 𝜋 with × and ⋈ 𝜋 𝐴 𝑅 × 𝑆 = 𝜋 𝐴∩𝑅 𝑅 × 𝜋 𝐴∩𝑆 (𝑆) 𝜋 𝐴 𝑅 ⋈ 𝑝 𝐵 𝑆 = 𝜋 𝐴∩𝑅 𝑅 ⋈ 𝑝(𝐵) 𝜋 𝐴∩𝑆 (𝑆) if 𝐵⊆𝐴 CS 564 (Fall'17)

Algebraic Rewriting: Example ⨝ R(A,B) S(B,C) ⨝ R(A,B) S(B,C) 𝜋B Why might we prefer this plan? CS 564 (Fall'17)

Algebraic Rewriting (Cont.) Basic commutators Push projection through Selection Join Push selection through Projection Re-ordered joins NOT an exhaustive set of rewriting rules CS 564 (Fall'17)

Algebraic Rewriting: Example (Cont.) Which of the following does hold? 𝜋 𝐴 𝑅 × 𝑆 = 𝜋 𝐴 𝑅 × 𝑆 and 𝐴⊆𝑅 𝜋 𝐴 𝑅 ⋈ 𝑝 𝐵 𝑆 = 𝜋 𝐴 𝜋 𝐶∩𝑅 𝑅 ⋈ 𝑝 𝐵 𝜋 𝐶∩𝑆 𝑆 and 𝐶 =𝐴∪𝐵 𝜎 𝑝 1 ∧ 𝑝 2 ∨ 𝑝 3 (𝑅)= 𝜎 𝑝 1 (𝑅)∩ 𝜎 𝑝 2 (𝑅)∪ 𝜎 𝑝 3 (𝑅) 𝜎 𝑝 𝐴 ∧ 𝑞 𝐵 (𝑅⋈𝑆)= 𝜎 𝑝 𝐴 (𝑅)⋈ 𝜎 𝑞(𝐵) (𝑆), 𝐴⊆𝑅 and 𝐵 ⊆𝑆 CS 564 (Fall'17)

Choose Physical Operation Implementation Given a (rewritten) LQP, pick a physical implementation for each logical operation Recall various RA operation implementations and their costs 𝜎: file scan vs. indexed (B+tree vs. hash) 𝜋: hashing-based vs. sorting-based ⋈: BNLJ vs. INLJ vs. SMJ vs. HJ Example: 𝜋 𝐵 (𝜎 𝑝(𝐴) 𝑅 ⋈𝑆) How many with logical rewrites? 16 PQPs 2 options 2 options 4 options CS 564 (Fall'17)

Choose Operation Implementation: Factors Are the indexes clustered or unclustered? Are there multiple matching indexes? Are index-only access paths possible for some operations? Are there “interesting orders” among the tuples? Would sorted outputs benefit downstream operations? Estimated cardinalities of intermediate results? How best to reorder multi-table joins? Hard, open research problem CS 564 (Fall'17)

Join Orderings Exponential number of orderings Example: 𝑅⋈𝑆⋈𝑇⋈𝑈 Associativity of join Example: 𝑅⋈𝑆⋈𝑇⋈𝑈 Almost all RDBMSs consider only left-deep join trees Enables easy pipelining ⨝ R S T U ⨝ R S T U ⨝ R S T U Bushy tree Left-deep tree Right-deep tree CS 564 (Fall'17)

Materialization vs. Pipelining Intermediate results Materialized Stored in a temporary relation When the execution of the corresponding operation is finished, pass the temp table to the next operation Pipelined Pass the intermediate results on-the-fly (as they are generated by the upstream operation) Benefits Display the outputs to the user incrementally Parallelize execution on multi-core CPUs CS 564 (Fall'17)

Blocking Operations Purchase Person Purchase Person Hash Join Table Scan B+tree Scan (Index Leaf Nodes) Hash-based Projection Purchase Person External Merge-sort Quicksort for Internal, B=20 Sort-merge Join Table Scan Hash-based Projection Purchase Person External Merge-sort Quicksort for Internal, B=20 Hash join results can be pipelined to projection Sorting results have to be materialized and cannot be pipelined CS 564 (Fall'17)

Iterator Interface Enables us to abstract away individual physical operation implementation details Makes pipelining easier Each physical operation should support three functions Open(): initialize the operation’s “state”, get arguments, allocate input and output buffers GetNext(): ask the operation to “deliver” next output tuple; pass it on; if blocking, wait Close(): clear operation’s state, free up space CS 564 (Fall'17)

Plan Cost Estimation Do not know the exact cost of PQPs (why?) For each PQP considered by the plan generator, the plan cost estimator computes estimated cost of the PQP Weighted sum of plan operations’ I/O costs and CPU costs Distributed RDBMSs also include network cost Challenge: Given a PQP, compute overall cost Pipelining vs. blocking ops; cannot simply add costs Hard to estimate cardinality of intermediate relations Remember we ignored OUTPUT cost!? CS 564 (Fall'17)

Plan Cost Estimation (Cont.) Estimating the cost of a query plan involves Estimating the cost of each operation in the plan, which depends on Input cardinalities Algorithm cost (known) Estimating the cardinalities of intermediate results Need statistics about input relations Tracked in system catalog CS 564 (Fall'17)

System Catalog Set of pre-defined relations for metadata about DB (schema) For each relation Relation name, File name and structure (heap file, clustered B+tree, etc.) Attribute names and types Integrity constraints Indexes For each index Index name, structure (B+tree, hash, etc.), search key For each view View name View definition CS 564 (Fall'17)

System Catalog: Example SQLite select * from sqlite_master; type        name        tbl_name    rootpage    sql ----------  ----------  ----------  ----------  ----------------------------------------------------table       User        User        2           CREATE TABLE User ( UID CHAR(20), Name CHAR(50), ... index       sqlite_aut  User        3                                                                    table       Event       Event       4           CREATE TABLE Event ( EID CHAR(20), Name CHAR(50),... index       sqlite_aut  Event       5                                                                    table       Participat  Participat  6           CREATE TABLE ParticipateIn ( EID CHAR(20), UID CH... index       sqlite_aut  Participat  7                                                                                            CS 564 (Fall'17)

Statistics in System Catalog RDBMS periodically collects stats about the data Can be slightly out-of-date For each table R Cardinality, i.e., number of tuples, NTuples(R) Size, i.e., number of pages, NPages(R), or just NR For each Index X Cardinality, i.e., number of distinct keys IKeys(X) Size, i.e., number of pages IPages(X) (for a B+ tree, this is the number of leaf pages only) Height (for tree indexes) IHeight(X) Min and max key values in index ILow(X), IHigh(X); i.e. key ranges More advanced query optimizers Histograms Wavelets CS 564 (Fall'17)

Plan Cost Estimation (Cont.) Most RDBMSs use various heuristics to make cost estimation tractable Example: complex predicates 𝜎 𝑝 1 ∧ 𝑝 2 (𝑅) Suppose selectivity of 𝑝 1 is 5% and selectivity of 𝑝 2 is 10% Q: What is the selectivity of 𝑝 1 ∧ 𝑝 1 ? Heuristic: assume independence of predicates Hence, selectivity of 𝑝 1 ∧ 𝑝 1 ≈0.05∗0.1=0.005 i.e. 0.5% A: Not enough info! CS 564 (Fall'17)

Example: System R Query Optimizer Unit of optimization: query block Example Optimize one block at a time Treat nested blocks as calls to a subroutine Execute inner block once per outer tuple In reality more complex optimization For each block, consider the following plans All available access paths, for each relation in FROM clause All join permutations of left-deep join trees SELECT S.name FROM Student S WHERE S.age IN (SELECT MAX (S2.age) FROM Student S2 GROUP BY S2.class) Outer block Nested block CS 564 (Fall'17)

System R Optimizer: Plan Enumeration Two main cases Single-relation plans Multiple-relation plans Single-relation plan (no joins) Access paths File scan Index scan(s) CS 564 (Fall'17)

System R Optimizer: Single-relation Plans Index scan(s) Clustered, non-clustered More than one index may “match” predicates One (clustered) index X matching one or more selects Cost = (IPages(X)+NPages(R)) * product of reduction factors of matching selects Choose the plan with the least estimated cost Merge/pipeline selection and projection (and aggregate) rid intersection techniques Index aggregate evaluation CS 564 (Fall'17)

System R Optimizer: Example Employee(SSN, Name, Address, Salary, DeptID) 1,000 data pages 10K tuples 100 pages in B+tree #depts: 10 Salary range: 10K–200K SELECT Name FROM Employee WHERE DeptID = 8 AND Salary > 40000 B+tree index on DeptID #tuples retrieved: (1/10) * 10,000 Clustered index: (1/10) * (100+1,000) pages Unclustered index: (1/10) * (100+10,000) pages B+tree index on Salary Clustered index: (200-40)/(200-10) * (100+1,000) pages Unclustered index: … File scan: 1,000 pages CS 564 (Fall'17)

Recap Plan generator and cost estimator work in tandem Rules determine what PQPs are enumerated Logical: algebraic rewrites of LQP Physical: operation implementations and ordering alternatives Cost models and heuristics help approximating the costs of the PQPs Active research area Parametric query optimization, multi-objective query optimization, multiple query optimization, online query optimization, dynamic re-optimization, etc. CS 564 (Fall'17)

Transaction Management Next Up Transaction Management Questions? CS 564 (Fall'17)