Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Relational Query Optimization Module 5, Lecture 2.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
David Konopnicki Choosing Access Path ä The basic methods. ä The access paths and when they are available. ä How the optimizer chooses among the.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Review for Midterm 2 Shahram Ghandeharizadeh. Midterm 2 Scheduled for April 30 th Scheduled for April 30 th 4 papers 4 papers  Variant indexes.  Access.
Prof. Ghandeharizadeh1 QUERY PROCESSING (CHAPTER 12)
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Processing & Optimization
Chapter 19 Query Processing and Optimization
Access Path Selection in a Relation Database Management System (summarized in section 2)
Query Processing Presented by Aung S. Win.
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Overview of Implementing Relational Operators and Query Evaluation
CSE 6331 © Leonidas Fegaras System R1 System R Optimizer Read the paper (available at the course web page): G. Selinger, M. Astrahan, D. Chamberlin, R.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
1 Overview of Query Evaluation Chapter Overview of Query Evaluation  Plan : Tree of R.A. ops, with choice of alg for each op.  Each operator typically.
Database Management 9. course. Execution of queries.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
CS 338Query Evaluation7-1 Query Evaluation Lecture Topics Query interpretation Basic operations Costs of basic operations Examples Textbook Chapter 12.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
CS4432: Database Systems II Query Processing- Part 1 1.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Database Management System
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
Access Path Selection in a Relational Database Management System
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
File Processing : Query Processing
Relational Operations
Advance Database Systems
Overview of Query Evaluation
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California

System R Grand-daddy of RDBMS Grand-daddy of RDBMS  Started in 1975 at IBM San Jose Research Lab.  Won the ACM Software System Award in  Introduced fundamental database concepts such as SQL, locking, logging, cost-based query optimization techniques, etc.

Four Phases of SQL Processing Parsing Parsing  Checks for correct SQL syntax,  Computes the list of items to be retrieved, the table(s) referenced, and boolean combination of simple predicates. Optimization Optimization  Looks up the tables in the database catalog for their existence and statistics, and available access paths.  Computes the execution plan with minimum cost.  Output: Execution plan in the Access Specification Language (ASL). Code generation Code generation  Code generator is a table-driven program which translates ASL tress into machine language code.  Parse tree is replaced by executable machine code and its associated data structures. This code can be stored away in the database for later execution. Execution Execution  Executes the machine code by invoking System R internal storage system (RSS) via the storage system interface (RSI) to scan each of the physically stored relations referenced by the query.

Research Storage System (RSS) Maintains physical storage of relations, access paths on these relations. Maintains physical storage of relations, access paths on these relations. Implements locking and logging. Implements locking and logging. RSS represents a relation as: RSS represents a relation as:  A collection of tuples stored in 4KB pages,  Columns of a tuple are physically contiguous,  No tuple spans a page.  Pages are organized into logical units called segments.  Segments may contain one or more relations.  Each tuple is tagged with the identification of the relation to which it belongs.  At most one relation per segment.

RSS (Cont…) Access tuples using a scan: OPEN, NEXT, and CLOSE. A scan returns a tuple at a time. Access tuples using a scan: OPEN, NEXT, and CLOSE. A scan returns a tuple at a time. Supports two types of scans: Supports two types of scans: 1. Segment scan: Find all tuples of a relation. All non-empty pages of a segment are referenced only once. 2. Index scan: B+-trees

Optimizer Formulates a cost prediction for each access plan, using the following cost formula: Formulates a cost prediction for each access plan, using the following cost formula: COST = Page fetches + W * (RSI Calls) W is an adjustable weighting factor between I/O and CPU. W is an adjustable weighting factor between I/O and CPU. RSI calls is an approximation for CPU utilization. RSI calls is an approximation for CPU utilization. Assumptions: Assumptions:  WHERE tree is considered to be in conjunctive normal form,  Every disjunct is called a boolean factor.

Optimizer (Motivation) Given a query, there are many ways to execute it. The optimizer must identify the best execution plan. Given a query, there are many ways to execute it. The optimizer must identify the best execution plan. Example: Example: SELECT name, title, sal FROM Emp, Job WHERE Emp.Job = Job.Job and Title = ‘CLERK’

Optimizer (Motivation) Example: Example: SELECT name, title, sal FROM Emp, Job WHERE Emp.Job = Job.Job and Title = ‘CLERK’ Decide order to perform the different operators: Decide order to perform the different operators:  process “Title = ‘CLERK’” followed by the join  Process the join “Emp.Job = Job.Job” followed by “Title = ‘CLERK’” Decide which index structure to use: Segment scan, clustered index, non-clustered index. Decide which index structure to use: Segment scan, clustered index, non-clustered index. Decide the join algorithm: nested-loops versus merge-scan. Decide the join algorithm: nested-loops versus merge-scan. This paper tries to answer all the above questions! This paper tries to answer all the above questions!

How? Enumerating the different execution plans, Enumerating the different execution plans, Estimate the cost of performing each plan, Estimate the cost of performing each plan, Pick the cheapest plan. Pick the cheapest plan. What is definition of cost? What is definition of cost?

How? Enumerating the different execution plans, Enumerating the different execution plans, Estimate the cost of performing each plan, Estimate the cost of performing each plan, Pick the cheapest plan. Pick the cheapest plan. What is definition of cost? What is definition of cost? COST = Page fetches + W * (RSI Calls)

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) Is ~(B OR C) in CNF? Is ~(B OR C) in CNF?

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) Is ~(B OR C) in CNF? Is ~(B OR C) in CNF? Fix it by carrying the negation inside: ~B AND ~C

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) How about (A AND B) OR C? How about (A AND B) OR C?

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) How about (A AND B) OR C? How about (A AND B) OR C? Transform it to (A OR C) AND (B OR C)

CNF Why? Why?  Every tuple returned to the user must satisfy every boolean factor.  If a tuple fails a boolean factor, discard it from farther consideration.

Database Catalog System R maintains statistics for each relation T: System R maintains statistics for each relation T:  NCARD(T), number of records in T  TCARD(T), number of pages in the segment that holds tuples of T  P(T), fraction of data pages in the segment that hold tuples of relation T P(T) = TCARD(T) / (# of non-empty pages in the segment) For each index I on relation T, For each index I on relation T,  ICARD(I), number of distinct keys in index I.  NINDX(I), number of pages in index I.

Maintenance of Statistics

Selectivity Factor (F) Corresponds to the expected fraction of tuples which will satisfy the predicate. Corresponds to the expected fraction of tuples which will satisfy the predicate. Column = value Column = value  F = 1 / ICARD(column index) with an index, assuming an even distribution of tuples among the index key values.  F = 1 / 10 otherwise.

Clustered Index Assume a student table: Student(name, age, gpa, major) Assume a student table: Student(name, age, gpa, major) t(Student) = 16 P(Student) = 4 Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS

Number of Records per GPA Actual GPA Values

ESTIMATING NUMBER OF RESULTING RECORDS For exact match selection predicates assume a uniform distribution of records across the number of unique values. E.g., the selection predicate is gpa = 3.3 For exact match selection predicates assume a uniform distribution of records across the number of unique values. E.g., the selection predicate is gpa = 3.3 For range selection predicates assume a uniform distribution of records across the range of available values defined by min and max. In this case, one must think about the interval. E.g., gpa > 3.5 For range selection predicates assume a uniform distribution of records across the range of available values defined by min and max. In this case, one must think about the interval. E.g., gpa >

Selectivity Factor (F) Column > value Column > value  F = (high key value – value) / (high key value – low key value) as long as the column is an arithmetic type and value is known at access path selection time.  F = 1/3 otherwise (column is not arithmetic)

Selectivity Factor (F) Column < value Column < value?

Selectivity Factor (F) Column < value Column < value  F = (value - low key value) / (high key value – low key value) as long as the column is an arithmetic type and value is known at access path selection time.  F = 1/3 otherwise (column is not arithmetic)

Selectivity Factor (F) Value1 < Column < Value2 Value1 < Column < Value2?

Selectivity Factor (F) Value1 < Column < Value2 Value1 < Column < Value2  F = (Value2 – Value1) / (high key value – low key value) as long as the column is arithmetic  F = ¼ otherwise

Selectivity Factor (F) Column in (list of values) Column in (list of values) Join predicate, Column 1 = Column 2 Join predicate, Column 1 = Column 2 Disjunctive predicate Disjunctive predicate

Selectivity Factor (F) Conjunctive predicate Conjunctive predicate Negation Negation

Interesting order A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. Example query: Example query:

Interesting order A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. Example query: Example query: Student(name, age, gpa, major) with a B+-tree on the gpa attribute SELECT name FROM Student WHERE gpa < 3.0 ORDER BY gpa SELECT gpa, count(*) FROM Student WHERE gpa < 3.0 GROUP BY gpa

B + -Tree A B+-tree on the gpa attribute A B+-tree on the gpa attribute Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (3, 1)) (3.8, (3,2)) (3.8, (3,3)) (3.9, (4,2)) (4, (4,3)) (3.8, (3,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (1, 1)) (2.5, (1,2)) (2.8, (1,3)) (3.1, (2,2)) (3.2, (2,3) (2.8, (1,4)) (3, (2,1)) (3.5, (2,4)) 3.6

Single Relation Access Paths Single relation access paths are simple selects with ORDER BY and GROUP BY clauses Single relation access paths are simple selects with ORDER BY and GROUP BY clauses SELECT name FROM Student WHERE age < 20 Without an index, must perform a segment scan, what is the cost? Without an index, must perform a segment scan, what is the cost?  TCARD / P + W * RSISCAN  TCARD(T), number of pages in the segment that holds tuples of T  P(T), fraction of data pages in the segment that hold tuples of relation T P(T) = TCARD(T) / (# of non-empty pages in the segment)  Why?

Single Relation Access Paths Single relation access paths are simple selects with ORDER BY and GROUP BY clauses Single relation access paths are simple selects with ORDER BY and GROUP BY clauses SELECT name FROM Student WHERE age < 20 Without an index, must perform a segment scan, what is the cost? Without an index, must perform a segment scan, what is the cost?  TCARD / P + W * RSISCAN  TCARD(T), number of pages in the segment that holds tuples of T  P(T), fraction of data pages in the segment that hold tuples of relation T P(T) = TCARD(T) / (# of non-empty pages in the segment)  Tuples of Student might be inter-mixed with professors. Example: the student table with TCARD = 100 pages and P(T) = Note that P(T) = 1 when the student table is not intermixed with another table.

Single Relation Access Paths Cost of scanning leaf pages and data pages Cost of scanning leaf pages and data pages Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (3, 1)) (3.8, (3,2)) (3.8, (3,3)) (3.9, (4,2)) (4, (4,3)) (3.8, (3,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (1, 1)) (2.5, (1,2)) (2.8, (1,3)) (3.1, (2,2)) (3.2, (2,3) (2.8, (1,4)) (3, (2,1)) (3.5, (2,4)) 3.6

Single Relation Access Paths Cost of scanning leaf pages and data pages containing the qualifying records Cost of scanning leaf pages and data pages containing the qualifying records

Non-Clustered B + -Tree A random I/O for every qualifying record A random I/O for every qualifying record Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (1, 1)) (3.8, (3,2)) (3.8, (2,1)) (3.9, (2,4)) (4, (3,1)) (3.8, (1,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (4, 2)) (2.5, (2,3)) (2.8, (2,2)) (3.1, (3,3)) (3.2, (1,3) (2.8, (3,4)) (3, (1,2)) (3.5, (4,3)) 3.6

Non-Clustered B + -Tree A random I/O for every qualifying record A random I/O for every qualifying record

R EQUALITY JOIN S: R.A = S.A Two algorithms for performing the join operator: nested loops and merge-scan. Two algorithms for performing the join operator: nested loops and merge-scan. Tuple nested loops: Tuple nested loops: for each tuple r in R do for each tuple s in S do for each tuple s in S do if r.A=s.A then output r,s in the result relation if r.A=s.A then output r,s in the result relation end-for end-forend-for Estimated cost of tuple nested loops: Estimated cost of tuple nested loops:  TCARD(R)/P(R) + [NCARD(R) × TCARD(S)/P(S)] TCARD(S)/P(S) NCARD(R)

EQUALITY JOIN (Cont … ) Merge-scan: Merge-scan: 1. Interesting order on R.A (sorted) 2. Interesting order on S.A (sorted) 3. Scan R and S in parallel, merging tuples with matching A values Estimated cost of merge scan: NINDX(I R ) + NINDX(I S )

N-Way Join N-Way joins as a sequence of 2-way joins. N-Way joins as a sequence of 2-way joins. Utilize pipelining whenever appropriate: Utilize pipelining whenever appropriate: The ordering of the joins is important. Consider all ordering such that: The ordering of the joins is important. Consider all ordering such that:  Join predicates relate the two participating tables together; do not consider cartesian products. For example if the join clause is (R.A = S.A and R.B = T.B) then it would be a mistake to use the following clause (S Cartesian product T) and R.A = ST.A and R.B = ST.B  Delay computation of cartesian products as much as possible.  Consider interesting orders in order to use merge-scan whenever possible.

Search Space Rather large search space for expressions joining several tables: Rather large search space for expressions joining several tables: Heuristics prune the search space: Heuristics prune the search space:

Nested Queries Correlation subquery: A subquery with a reference to a value obtained from a candidate tuple of a higher level query block. Correlation subquery: A subquery with a reference to a value obtained from a candidate tuple of a higher level query block.

Non-Correlation sub-queries Evaluate the inner query once and use its results to process the outer query. Evaluate the inner query once and use its results to process the outer query.