CS222P: Principles of Data Management UCI, Fall 2018 Notes #12 Set operations, Aggregation, Cost Estimation Instructor: Chen Li.

Slides:



Advertisements
Similar presentations
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Advertisements

Implementation of relational operations
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Relational Query Optimization Module 5, Lecture 2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
1 Overview of Query Evaluation Chapter Overview of Query Evaluation  Plan : Tree of R.A. ops, with choice of alg for each op.  Each operator typically.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
1 Database Systems ( 資料庫系統 ) December 3, 2008 Lecture #10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 12: Query Processing
Evaluation of Relational Operations
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
CS222: Principles of Data Management Notes #13 Set operations, Aggregation Instructor: Chen Li.
Database Applications (15-415) DBMS Internals- Part VI Lecture 18, Mar 25, 2018 Mohammad Hammoud.
File Processing : Query Processing
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #12 Joins!
CS222: Principles of Data Management Notes #09 Indexing Performance
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
CS222: Principles of Data Management Notes #12 Joins!
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #09 Indexing Performance
Lecture 2- Query Processing (continued)
Relational Query Optimization
Overview of Query Evaluation
Implementation of Relational Operations
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Database Applications (15-415) DBMS Internals- Part X Lecture 22, April 3, 2018 Mohammad Hammoud.
Database Systems (資料庫系統)
Yan Huang - CSCI5330 Database Implementation – Query Processing
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #12 Set operations, Aggregation, Cost Estimation Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #08 Comparisons of Indexes and Indexing Performance Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
Presentation transcript:

CS222P: Principles of Data Management UCI, Fall 2018 Notes #12 Set operations, Aggregation, Cost Estimation Instructor: Chen Li

Relational Set Operations Intersection and cross-product special cases of join. Union (distinct) and Except similar; we’ll do union. Sorting-based approach to union: Sort both relations (on combination of all attributes). Scan sorted relations and merge them. Alternative: Merge runs from Pass 0+ for both relations (!). Hash-based approach to union (from Grace): Partition both R and S using hash function h1. For each S-partition, build in-memory hash table (using h2), then scan corresponding R-partition, adding truly new S tuples to hash table while discarding duplicates. 19

Sets versus Bags UNION, INTERSECT, and DIFFERENCE (EXCEPT or MINUS) use the set semantics. UNION ALL just combine two relations without considering any correlation. 19

Sets versus Bags Example: R = {1, 1, 1, 2, 2, 2, 2, 3} S = {1, 2, 2} R UNION S = {1, 2, 3} R UNION ALL S = {1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3} R INTERSECT S = {1, 2} R EXCEPT S = {3} Notice that some DBs such as MySQL don't support some of these operations. 19

Aggregate Operations (AVG, MIN, ...) Without grouping: In general, requires scanning the full relation. Given an index whose search key includes all attributes in the SELECT or WHERE clauses, can do an index-only scan. With grouping: Sort on the group-by attributes, then scan sorted result and compute the aggregate for each group. (Can improve on this by combining sorting and aggregate computations.) Or, similar approach via hashing on the group-by attributes. Given tree index whose search key includes all attributes in SELECT, WHERE and GROUP BY clauses, can do an index-only scan; if group-by attributes form prefix of search key, can retrieve data entries (tuples) in the group-by order. 20

Aggregation State Handling State: init( ), next(value), finalize( )  agg. value Consider the following query SELECT E.dname, avg(E.sal) FROM Emp E GROUP BY E.dname (dname count sum) Emp Toy 2 35K ename sal dname h(dname) mod 4 Joe 10K Toy Shoe 2 33K h Sue 20K Shoe Mike 13K Shoe Chris 25K Toy Aggregate state (per unfinished group): Count: # values Min: min value Max: max value Sum: sum of values Avg: count, sum of values Zoe 50K Book … … … 3

Next: query optimization SQL . We are here… Query Parser Query Optimizer Plan Executor Relational Operators (+ Utilities) Files of Records Buffer Manager Access Methods (Indices) Lock Manager Transaction Log Manager Disk Space and I/O Manager Data Files Index Files Catalog Files WAL 22

Two problems in query optimization Problem 1: Estimating the cost of a given query plan Estimating the cost of each operator in a plan Add the costs as the cost of the plan Problem 2: Enumerating query plans and searching for a good one Reserves Sailors sid=sid bid=100 rating > 5 sname 8

For each operator in a plan … Reserves Sailors sid=sid bid=100 rating > 5 sname Estimate its cost: CPU Disk IO’s Memory Network (if distributed) … Estimate the size of its output The info can be used to do estimation for later operators Since we already covered cost estimation earlier, we will focus on size estimation 8

Statistics in catalog Optimizers use statistics in catalog to estimate the cardinalities of operators’ inputs and outputs Simplifying assumptions: uniform distribution of attribute values, independence of attributes, etc. For each relation R: Cardinality of R (|R|), avg R-tuple width, and # of pages in R (||R||) – pick any 2 to know all 3 (given the page size) For each (indexed) attribute of R: Number of distinct values |πA(R)| Range of values (i.e., low and high values) Number of index leaf pages Number of index levels (if B+ tree) 16

Simple Selection Queries (σp) Equality predicate (p is “A = val”) |Q| ≈ |R| / |πA(R)| Translation: R’s cardinality divided by the number of distinct A values Assumes all values equally likely in R.A (uniform distribution) Ex: SELECT * FROM Emp WHERE age = 23; RF (a.k.a. selectivity) is therefore 1 / |πA(R)| Range predicate (p is “val1 ≤ A ≤ val2”) |Q| ≈ |R| * ((val2 – val1) / ((high(R.A) – low(R.A)) Translation: Selected range size divided by full range size Again assumes uniform value distribution for R.A (Simply replace vali with high/low bound for a one-sided range predicate) Ex: SELECT* FROM Emp WHERE age ≤ 21; RF is ((val2 – val1) / ((high(R.A) – low(R.A)) 16

Boolean Selection Predicates Conjunctive predicate (p is “p1 and p2”) RFp ≈ RFp1 * RFp2 Ex 1: SELECT * FROM Emp WHERE age = 21 AND gender = ‘m’ Assumes independence of the two predicates p1 and p2 (uncorrelated) Ex 2: SELECT * FROM Student WHERE major = ‘EE’ AND gender = ‘m’ Negative predicate (p is “not p1”) RFp ≈ 1 – RFp1 Translation: All tuples minus the fraction that satisfy p1 Ex: SELECT * FROM Emp WHERE age ≠ 21 Disjunctive predicate (p is “p1 or p2”) RFp ≈ RFp1 + RFp2 – (RFp1 * RFp2 ) Translation: All (unique) tuples satisfying either predicate Ex: SELECT * FROM Student WHERE major = ‘EE’ OR gender = ‘m’  Q: Why? 16

Two-way Equijoin Predicates Query Q: R join S on R.A = S.B Assume join value set containment (FK/PK case) πA(R) is a subset of πB(S) or vice versa Translation: “Smaller” relation’s join value set is subset of “larger” relation’s join value set (where “size” is based on # unique values) Ex: SELECT * FROM Student S, Dept D WHERE S.major = D.deptname |Q| ≈ (|R| * |S|) / max( |πA(R)|, |πB(S)| ) Ex: 100 D.deptname values but 85 S.major values used by 10,000 students Estimated size of result is 1/100 of the cartesian product of Student and Dept Again making a uniformity assumption (i.e., about students’ majors) I.e., RF is 1 / max( |πA(R)|, |πB(S)| ) Repeat same principle to deal with N-way joins  Q: Why? (Why not 1/85?) 16

Improved Estimation: Histograms We have been making simplistic assumptions Specifically: uniform distribution of values This is definitely violated (all the time ) in reality Violations can lead to huge estimation errors Huge estimation errors can lead to Very Bad Choices By the way: In the absence of info, System R assumed 1/3 and 1/10 for range queries and exact match queries, respectively (Q: What might lead to the absence of info?) How can we do better in the OTHER direction: Keep track of the most and/or least frequent values Use histograms to better approximate value distributions 16

Equi-width Histograms Divide the domain into B buckets of equal width E.g., partition Kid.age values into buckets Store the bucket boundaries and the sum of frequencies of the values with each bucket

Histogram Construction Initial Construction: Make one full one pass over R to construct an accurate equi-width histogram Keep a running count for each bucket If scanning is not acceptable, use sampling Construct a histogram on Rsample, and scale the frequencies by |R|/|Rsample| Maintenance Options: Incremental: for each update or R, increment or decrement the corresponding bucket frequencies (Q: Cost?) Periodically recompute: distribution changes slowly!

Using Equi-width Histograms Q: sA=5(R) 5 is in bucket [5,8] (with 19 tuples) Assume uniform distribution within the bucket Thus |Q|  19/4  5. Actual value is 1 Q: sA>=7 & A <= 16(R) [7,16] covers [9,12] (27 tuples) and [13,16] (13 tuples) [7,16] partially covers [5,8] (19 tuples) Thus |Q|  19/2 + 27 + 13  50 Actual value = 52.

Equi-height Histogram Divide the domain into B buckets with (roughly) the same number of tuples in each bucket Store this number and the bucket boundaries Intuition: high frequencies are more important than low frequencies

Histogram Construction Sort all R.A values, and then take equally spaced slices Ex: 1 2 2 3 3 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8 8 … Sampling also applicable here Maintenance: Incremental maintenance Split/merge buckets (B+ tree like) Periodic recomputation

Using an equi-height histogram Q: sA=5(R) 5 is in bucket [1,7] (with 16 tuples) Assume uniform distribution within the bucket Thus |Q|  16/7  2. (actual value = 1) Q: sA>=7 & A <= 16(R) [7,16] covers [8,9], [10,11],[12,16] (all with tuples) [7,16] partially covers [1,7] (16 tuples) Thus |Q|  16/7 + 16 + 16 + 16  50 Actually |Q| = 52.

Can Combine Approaches If values are badly skewed Keep high/low frequency value information in addition to histograms Could even apply the idea recursively: keep this sort of information for each bucket “Divide” by converting values to some # ranges “Conquer” by keeping some statistics for each range Some “statistical glitches” to be aware of Parameterized queries Runtime situations