Highlights of Query Processing And Optimization

Slides:



Advertisements
Similar presentations
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Advertisements

Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Chapter 19 Query Processing and Optimization
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
CS 440 Database Management Systems
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Running Example – Airline
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 12: Query Processing
Lecture 16: Relational Operators
Introduction to Query Optimization
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
Examples of Physical Query Plan Alternatives
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Overview of Query Evaluation
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Advance Database Systems
Overview of Query Evaluation
Implementation of Relational Operations
Lecture 13: Query Execution
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
Query Optimization Highlights
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Lecture 20: Query Execution
Presentation transcript:

Highlights of Query Processing And Optimization Chapters 12 and 13 (pretty much finalized) 12/25/2018

Textbook Reading For Ch.12-13 In general, read lightly; concentrate on topics that you recognize from lecture. Chapter 12, “Evaluation of Relational Operators Skim all cost calculation discussions Skip 12.2.2; 12.3 (all); pp.297-300; 12.7 Chapter 13, “Relational Query Optimization” Read pp.310-314 skim 316 (starting at 13.2) to 325; skip rest of chapter 12/25/2018

Big Picture SQL query Parse query DB schema Query optimizer: Generate query plans, select a plan DB catalog Execute plan DB output results 12/25/2018

Parsing Produce a version of the SQL query in Extended Relational Algebra form Query execution can then be viewed as a matter of executing the relational operators “Extended RA” refers to RA plus aggregate operators, Having, and Group By 12/25/2018

Plan of Study Investigate how each of the basic RA operators could be executed Selection, projection, join are most fundamental and will illustrate the principles Set operations Extended operations Later discuss plan generation and evaluation (very lightly) 12/25/2018

Contrast Compiling a program Processing a query Sharp distinction between “compile time” and “run time” Translation of parsed program is fairly mechanical Optimization is just icing on the cake Processing a query It’s all done at run time Translating the query involves many non-trivial, dynamic decisions Optimization is hugely important 12/25/2018

Preview: Some Dynamic Factors SELECT X,Y FROM A,B WHERE A.Z = B.Z How this is best processed may depend on: relative and absolute sizes of A and B; sizes of X, Y, and Z; logical and physical organization of A and B; what kind of indices exist for A and B; how much memory is available, etc, etc. 12/25/2018

Access Path “Access path:” a way of retrieving tuples from a relation Assuming some selection criterion such as WHERE S.id <= 200 Either: A scan of the whole file An index which has an appropriate match There might be indexes that don’t provide an access path because they don’t relate to the selection criterion 12/25/2018

Selection Implementation No relevant index, unsorted data: retrieve all pages No index, sorted data (on relevant attribute): binary search to find 1st occurrence, then read B+ tree index (on relevant attr.) Search tree to find occurrences Cost depends on whether index is clustered or not Works for range as well as = selections Hash index: works only for = selections which match the hash field 12/25/2018

Selection Headaches Range queries Compound conjunctive conditions S.rating > 5 Compound conjunctive conditions (S.rating = 5) ^ (S.age > 2) Compound disjunctive conditions (S.rating = 5)  (S.age > 2) All of these occur frequently in practice Disjunctive conditions are probably the hardest to optimize in the general case 12/25/2018

Projection Implementation General strategy: remove unwanted attributes, then remove duplicate tuples Removing duplicate tuples is the hard part 12/25/2018

Removing Duplicate Tuples By sorting: use all (projected) attributes as the sort key Can spot duplicates in a final scan over the data By hashing: hash on all (projected) attributes Can think of the file as being “partitioned” Duplicates will always collide into the same partition (bucket) Next, rehash each bucket with a different hash function Output unique tuples from the 2nd level buckets 12/25/2018

Join Implementation Algebraically, join is  followed by  Not implemented that way! joins occur often enough to deserve special consideration  can then be implemented as join! Many algorithms have been tried 12/25/2018

General Join Strategies (R x S) Nested-loop join Sort-merge join Hashed join 12/25/2018

Nested-Loop Join Algorithm: For each r in R, examine each s in S, output tuples which match the join condition R is the “outer” relation, S is “inner” Cost proportional to size of R * size of S i.e., potentially huge If either R or S fit entirely in memory: Use smaller as the inner relation this is simple and not too bad: cost proportional to R + S 12/25/2018

Sort-Merge Join Algorithm (for equi-join): Sorts are somewhat costly Sort both R and S on the join attributes (this effectively “partitions” R and S) In a merge-like phase, scan R and S: for each partition of R, locate the corresponding partition of S and output tuples Sorts are somewhat costly But may not be needed if file is already sorted, has a B+ tree index, etc. 12/25/2018

Footnote on Disk Sorting Ch. 7 in textbook (skipped) General plan: Read the file (once) Write out sorted "runs" Merge the runs In practice, only a few passes over the file is need size of runs can be increased by using more memory buffers, overlapping in-mem. sorting with I/O 12/25/2018

Hashed Join Algorithm Algorithm (for equi-join): Hash each of R and S on the join attributes, using the same hash function each bucket of collisions is a partition For each pair of corresponding partitions of R and S: hash again using a different hash function r and s which hash the second time to the same bucket can be checked; output if match 12/25/2018

Hashed Join Discussion Can be quite efficient Especially if the partition of R or S fits completely in memory Doesn’t extend to non = join Doesn’t take advantage of existing indices except a hash index exactly on the desired attributes Doesn’t work well if hash functions doesn’t distribute tuples well 12/25/2018

Set Operations: Union Union: trick is to eliminate duplicates Via sorting: sort R and S separately use all attributes as the sort key duplicates can be eliminated during the final merge phase Via hashing: Partition R and S using the same hash function, using all attributes Rehash corresponding partitions using a different function to detect duplicates 12/25/2018

Other Set Operations Difference: variations of Union algorithms will work Intersection: can treat as huge equi-join (all attributes joined for equality) Cartesian Product: can treat as join, with no selection condition rarely needed in practice 12/25/2018

Extended Relational Operators Techniques similar to those already discussed can be devised Canonically the EROs are applied after the basic . GROUP BY Form partitions by sorting or hashing HAVING Similar to , operating on partitions Aggregate functions Apply to the partitions 12/25/2018

Sorting vs. Hashing Many implementation problems have both a sorting and a hashing solution Sorting A DBMS needs sorting anyway, so general-purpose sorting utility is available Applies in more situations (range selections, etc.) Result is sorted, which might be useful Kind of a blunt instrument sometimes Hashing More elegant Makes effective use of large memories 12/25/2018

Naïve Query Processing SELECT LNAME FROM EMPLOYEE, DEPARTMENT WHERE DNO=DNUMBER AND SALARY>45000 AND DNAME="Software Support"; Naively, this is 1 Cartesian product, followed by 3 selects, followed by one project. Query tree is drawn from bottom up. 12/25/2018

Plans The Query Processor might actually execute this as 2 selects, a join, a select and a project. A smarter optimizer might do additional projects. Seem intuitive? It's really intuitive only for the smallest queries. Execution plan or strategy is a plan for getting the result of a query. Includes not only the order, but what implementation (sorting, hashing, etc.) to use 12/25/2018

Optimization Strategy Query optimization means finding the best execution strategy. Usually means "picking the best plan" rather than fiddling with some given plan General strategy Generate a number of plans Evaluate each using statistics in the DB catalog Pick the best one (or at least a good one) 12/25/2018

Pipelining vs Materialization How to make the step from one stage of the query processing to the next? Materialization: Create a temporary relation as the output of a stage, pass to next stage as input Pipelining: Apply next operator to the output of one stage, as the output is generated. Especially unary ops (projections and selection) 12/25/2018

Helpful Ideas and Heuristics Reduce table size before joins: Push (or copy) selects and projects as far down the tree as possible Do joins and C.P.s as late as possible Do operations in decreasing order of selectivity (if known) DBMS might keep useful statistics in a catalog Combine single-table operations when possible (work “on the fly”) Use indexes to advantage 12/25/2018

DB Catalog As a minimum, DBMS knows the schema Also knows: Relation and attribute names, types, field sizes Also knows: what indices exist file organization: # tuples per page, sorted/unsorted, etc. Could also keep statistics # of tuples in each relation, # pages in file range of keys currently in each relation disk performance factors 12/25/2018

Speedbumps Incomplete information Incomplete model of processing Catalog stats, etc. Incomplete model of processing I.e., calculations are simplified Number of possible plans is exponential! Result: rather than truly "optimal", a plan which is "good enough" is chosen 12/25/2018

May be some more slides here... 12/25/2018