Highlights of Query Processing And Optimization Chapters 12 and 13 (pretty much finalized) 12/25/2018
Textbook Reading For Ch.12-13 In general, read lightly; concentrate on topics that you recognize from lecture. Chapter 12, “Evaluation of Relational Operators Skim all cost calculation discussions Skip 12.2.2; 12.3 (all); pp.297-300; 12.7 Chapter 13, “Relational Query Optimization” Read pp.310-314 skim 316 (starting at 13.2) to 325; skip rest of chapter 12/25/2018
Big Picture SQL query Parse query DB schema Query optimizer: Generate query plans, select a plan DB catalog Execute plan DB output results 12/25/2018
Parsing Produce a version of the SQL query in Extended Relational Algebra form Query execution can then be viewed as a matter of executing the relational operators “Extended RA” refers to RA plus aggregate operators, Having, and Group By 12/25/2018
Plan of Study Investigate how each of the basic RA operators could be executed Selection, projection, join are most fundamental and will illustrate the principles Set operations Extended operations Later discuss plan generation and evaluation (very lightly) 12/25/2018
Contrast Compiling a program Processing a query Sharp distinction between “compile time” and “run time” Translation of parsed program is fairly mechanical Optimization is just icing on the cake Processing a query It’s all done at run time Translating the query involves many non-trivial, dynamic decisions Optimization is hugely important 12/25/2018
Preview: Some Dynamic Factors SELECT X,Y FROM A,B WHERE A.Z = B.Z How this is best processed may depend on: relative and absolute sizes of A and B; sizes of X, Y, and Z; logical and physical organization of A and B; what kind of indices exist for A and B; how much memory is available, etc, etc. 12/25/2018
Access Path “Access path:” a way of retrieving tuples from a relation Assuming some selection criterion such as WHERE S.id <= 200 Either: A scan of the whole file An index which has an appropriate match There might be indexes that don’t provide an access path because they don’t relate to the selection criterion 12/25/2018
Selection Implementation No relevant index, unsorted data: retrieve all pages No index, sorted data (on relevant attribute): binary search to find 1st occurrence, then read B+ tree index (on relevant attr.) Search tree to find occurrences Cost depends on whether index is clustered or not Works for range as well as = selections Hash index: works only for = selections which match the hash field 12/25/2018
Selection Headaches Range queries Compound conjunctive conditions S.rating > 5 Compound conjunctive conditions (S.rating = 5) ^ (S.age > 2) Compound disjunctive conditions (S.rating = 5) (S.age > 2) All of these occur frequently in practice Disjunctive conditions are probably the hardest to optimize in the general case 12/25/2018
Projection Implementation General strategy: remove unwanted attributes, then remove duplicate tuples Removing duplicate tuples is the hard part 12/25/2018
Removing Duplicate Tuples By sorting: use all (projected) attributes as the sort key Can spot duplicates in a final scan over the data By hashing: hash on all (projected) attributes Can think of the file as being “partitioned” Duplicates will always collide into the same partition (bucket) Next, rehash each bucket with a different hash function Output unique tuples from the 2nd level buckets 12/25/2018
Join Implementation Algebraically, join is followed by Not implemented that way! joins occur often enough to deserve special consideration can then be implemented as join! Many algorithms have been tried 12/25/2018
General Join Strategies (R x S) Nested-loop join Sort-merge join Hashed join 12/25/2018
Nested-Loop Join Algorithm: For each r in R, examine each s in S, output tuples which match the join condition R is the “outer” relation, S is “inner” Cost proportional to size of R * size of S i.e., potentially huge If either R or S fit entirely in memory: Use smaller as the inner relation this is simple and not too bad: cost proportional to R + S 12/25/2018
Sort-Merge Join Algorithm (for equi-join): Sorts are somewhat costly Sort both R and S on the join attributes (this effectively “partitions” R and S) In a merge-like phase, scan R and S: for each partition of R, locate the corresponding partition of S and output tuples Sorts are somewhat costly But may not be needed if file is already sorted, has a B+ tree index, etc. 12/25/2018
Footnote on Disk Sorting Ch. 7 in textbook (skipped) General plan: Read the file (once) Write out sorted "runs" Merge the runs In practice, only a few passes over the file is need size of runs can be increased by using more memory buffers, overlapping in-mem. sorting with I/O 12/25/2018
Hashed Join Algorithm Algorithm (for equi-join): Hash each of R and S on the join attributes, using the same hash function each bucket of collisions is a partition For each pair of corresponding partitions of R and S: hash again using a different hash function r and s which hash the second time to the same bucket can be checked; output if match 12/25/2018
Hashed Join Discussion Can be quite efficient Especially if the partition of R or S fits completely in memory Doesn’t extend to non = join Doesn’t take advantage of existing indices except a hash index exactly on the desired attributes Doesn’t work well if hash functions doesn’t distribute tuples well 12/25/2018
Set Operations: Union Union: trick is to eliminate duplicates Via sorting: sort R and S separately use all attributes as the sort key duplicates can be eliminated during the final merge phase Via hashing: Partition R and S using the same hash function, using all attributes Rehash corresponding partitions using a different function to detect duplicates 12/25/2018
Other Set Operations Difference: variations of Union algorithms will work Intersection: can treat as huge equi-join (all attributes joined for equality) Cartesian Product: can treat as join, with no selection condition rarely needed in practice 12/25/2018
Extended Relational Operators Techniques similar to those already discussed can be devised Canonically the EROs are applied after the basic . GROUP BY Form partitions by sorting or hashing HAVING Similar to , operating on partitions Aggregate functions Apply to the partitions 12/25/2018
Sorting vs. Hashing Many implementation problems have both a sorting and a hashing solution Sorting A DBMS needs sorting anyway, so general-purpose sorting utility is available Applies in more situations (range selections, etc.) Result is sorted, which might be useful Kind of a blunt instrument sometimes Hashing More elegant Makes effective use of large memories 12/25/2018
Naïve Query Processing SELECT LNAME FROM EMPLOYEE, DEPARTMENT WHERE DNO=DNUMBER AND SALARY>45000 AND DNAME="Software Support"; Naively, this is 1 Cartesian product, followed by 3 selects, followed by one project. Query tree is drawn from bottom up. 12/25/2018
Plans The Query Processor might actually execute this as 2 selects, a join, a select and a project. A smarter optimizer might do additional projects. Seem intuitive? It's really intuitive only for the smallest queries. Execution plan or strategy is a plan for getting the result of a query. Includes not only the order, but what implementation (sorting, hashing, etc.) to use 12/25/2018
Optimization Strategy Query optimization means finding the best execution strategy. Usually means "picking the best plan" rather than fiddling with some given plan General strategy Generate a number of plans Evaluate each using statistics in the DB catalog Pick the best one (or at least a good one) 12/25/2018
Pipelining vs Materialization How to make the step from one stage of the query processing to the next? Materialization: Create a temporary relation as the output of a stage, pass to next stage as input Pipelining: Apply next operator to the output of one stage, as the output is generated. Especially unary ops (projections and selection) 12/25/2018
Helpful Ideas and Heuristics Reduce table size before joins: Push (or copy) selects and projects as far down the tree as possible Do joins and C.P.s as late as possible Do operations in decreasing order of selectivity (if known) DBMS might keep useful statistics in a catalog Combine single-table operations when possible (work “on the fly”) Use indexes to advantage 12/25/2018
DB Catalog As a minimum, DBMS knows the schema Also knows: Relation and attribute names, types, field sizes Also knows: what indices exist file organization: # tuples per page, sorted/unsorted, etc. Could also keep statistics # of tuples in each relation, # pages in file range of keys currently in each relation disk performance factors 12/25/2018
Speedbumps Incomplete information Incomplete model of processing Catalog stats, etc. Incomplete model of processing I.e., calculations are simplified Number of possible plans is exponential! Result: rather than truly "optimal", a plan which is "good enough" is chosen 12/25/2018
May be some more slides here... 12/25/2018