2015-6-27Prof. Ghandeharizadeh1 QUERY PROCESSING (CHAPTER 12)

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
CS 540 Database Management Systems
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Query Processing and Optimization
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.
V Storage Manager Shahram Ghandeharizadeh Computer Science Department University of Southern California.
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Review for Midterm 2 Shahram Ghandeharizadeh. Midterm 2 Scheduled for April 30 th Scheduled for April 30 th 4 papers 4 papers  Variant indexes.  Access.
Physical Organization Index Structures Query Processing Vishy Poosala.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Processing & Optimization
Chapter 19 Query Processing and Optimization
1 Query Optimization Vishy Poosala Bell Labs. 2 Outline Introduction Necessary Details –Cost Estimation –Result Size Estimation Standard approach for.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CS 338Query Evaluation7-1 Query Evaluation Lecture Topics Query interpretation Basic operations Costs of basic operations Examples Textbook Chapter 12.
Copyright © Curt Hill Query Evaluation Translating a query into action.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Storage and Indexing1 Overview of Storage and Indexing.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Query Processing CS 405G Introduction to Database Systems.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
CS 540 Database Management Systems
Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
INDEXING (CHAPTER 11) 9/16/2018.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
File Processing : Query Processing
File Processing : Query Processing
Database Management Systems (CS 564)
Chapters 15 and 16b: Query Optimization
Lecture 2- Query Processing (continued)
Chapter 12 Query Processing (1)
Evaluation of Relational Operations: Other Techniques
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Lecture 11: B+ Trees and Query Execution
Database Administration
Yan Huang - CSCI5330 Database Implementation – Query Processing
Evaluation of Relational Operations: Other Techniques
Lecture 20: Query Execution
Presentation transcript:

Prof. Ghandeharizadeh1 QUERY PROCESSING (CHAPTER 12)

Prof. Ghandeharizadeh2 Software Architecture of a DBMS File System Buffer Pool Manager Abstraction of records Index structures Query Interpretor Query Optimizer Relational Algebra operators: , , , , , , , ,  Query Parser

Prof. Ghandeharizadeh3 TERM DEFINITION P(R): Number of pages that constitute R t(R): Number of tuples that constitute R ν(A,R): the number of unique values for attribute A of R min(A,R): the minimum value for attribute A of R max(A,R): the maximum value for attribute A of R P(I R,A ): the number of pages that constitute the B + -tree index on attribute A of R d(I R,A ): the depth of a B + -tree index on attribute A of R lp(I R,A ): the number of leaf pages for a B + -tree index on attribute A of R B(I R,A ): the number of buckets for a hash index on attribute A of R

Prof. Ghandeharizadeh4 HEAP FILE ORGANIZATION Assume a student table: Student(name, age, gpa, major) t(Student) = 16 P(Student) = 4 Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS

Prof. Ghandeharizadeh5 Non-Clustered Hash Index A non-clustered hash index on the age attribute with 4 buckets, h(age) = age % B Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (21, (1, 1)) (17, (2,4)) (29, (3,2)) (24, (1, 2)) (32, (3,1)) (20, (1,3)) (18, (1, 4)) (22, (2,2)) (22, (4,1)) (19, (2, 1)) (28, (4,2)) (20, (4,3)) (16, (4,4)) (19, (3, 4)) (18, (2,3)) (24, (3,3))

Prof. Ghandeharizadeh6 Clustered Hash Index A clustered hash index on the age attribute with 4 buckets, h(age) = age % B Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS

Prof. Ghandeharizadeh7 Non-Clustered Secondary B + -Tree A non-clustered secondary B+-tree on the gpa attribute Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (1, 1)) (3.8, (3,2)) (3.8, (2,1)) (3.9, (2,4)) (4, (3,1)) (3.8, (1,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (4, 2)) (2.5, (2,3)) (2.8, (2,2)) (3.1, (3,3)) (3.2, (1,3) (2.8, (3,4)) (3, (1,2)) (3.5, (4,3)) 3.6

Prof. Ghandeharizadeh8 Non-Clustered Primary B + -Tree A non-clustered primary B+-tree on the gpa attribute Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (3, 1)) (3.8, (3,2)) (3.8, (3,3)) (3.9, (4,2)) (4, (4,3)) (3.8, (3,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (1, 1)) (2.5, (1,2)) (2.8, (1,3)) (3.1, (2,2)) (3.2, (2,3) (2.8, (1,4)) (3, (2,1)) (3.5, (2,4)) 3.6

Prof. Ghandeharizadeh9 Clustered B + -Tree A clustered B+-tree on the gpa attribute It is impossible to have a clustered secondary B + -tree on an attribute. Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS

Prof. Ghandeharizadeh10 COST OF PERFORMING SELECT(б θ (R)) OPERATOR Exact match queries (θ = attribute A equals constant C): Heap: Scan the relation one page after another, until end of relation: P(R) Primary B + -tree: Use the constant to traverse the depth of the tree to the leftmost data record, begin to search forward: d(I) + P(R) / ν(A,R) Secondary B + -tree : Locate the left most record using the constant, initiate the search using the records in the leaf nodes. For each index record that matches, perform a random I/O to the data file: d(I) + lp(I) / ν(A,R) + t(R) / ν(A,R) Hash index assuming a uniform distribution of tuples across pages: probe the hash index with the constant: P(R) / B(I)

Prof. Ghandeharizadeh11 COST OF PERFORMING SELECT(б θ (R)) OPERATOR (Cont…) Range queries (θ = attribute A is greater than constant C): Number of tuples that satisfy the selection predicate: t(θ =‘A>C’) = [max(A,R)-C] / [max(A,R)-min(A,R)] × t(R) Heap: Scan the relation one page after another, until end of relation: P(R) Primary B + -tree: d(I) + [t(θ =‘A>C’) × P(R) / t(R)] Secondary B + -tree: d(I) + t(θ =‘A>C’) / [t(R) / Ip(I)] + t(θ =‘A>C’)

Prof. Ghandeharizadeh12 ESTIMATING NUMBER OF RESULTING RECORDS For exact match selection predicates assume a uniform distribution of records across the number of unique values. E.g., the selection predicate is gpa = 3.3 For range selection predicates assume a uniform distribution of records across the range of available values defined by min and max. In this case, one must think about the interval. E.g., gpa >

Prof. Ghandeharizadeh13 ESTIMATING NUMBER OF RESULTING RECORDS (Cont…) Actual GPA Values

Prof. Ghandeharizadeh14 PROJECTION The previous lecture described the selection operator and techniques to estimate both its costs and produced number of tuples. These techniques assume a uniform distribution of values within the minimum and maximum values of an attribute, uniform distribution of tuples across the buckets of a hash index. Projection (π A ( R)) Logically, the system scans R and produces as output the unique values for attribute A (eliminates duplicates). The number of tuples produced by this operator is (π A ( R)) with duplicate elimination and t(R) without duplicate elimination. Without duplicate elimination, the cost of this operator is one scan of relation R; cost equals P(R). With duplicate elimination, its cost equals: sort(R) + P(R).

Prof. Ghandeharizadeh15 EXTERNAL SORTING Sort a relation that is larger than the available main memory. Consider a relation R with P(R) pages and w buffer pages (w ≥ 3). Assume that P(R)/w=2 k =(w-1) l First approach, 2-way merge sort: 1: Read in, sort in main memory, and write out chunks of w pages. This step produces 2 k chunks of w pages. 2: Read in, merge in main memory, and write out pairs of sorted chunks of w pages. This produces 2 k-1 sorted chunks of w × 2 pages. 3: Read in, merge in main memory, and write out pairs of sorted chunks of w pages. This produces 2 k-2 sorted chunks of w × 2 2 pages..:..... k: Read in, merge in main memory, and write out pairs of sorted chunks of w × 2 k-2 pages. This produces 2 sorted chunks of w × 2 k-1 pages. k+1:Read in, merge in main memory, and write out pairs of sorted chunks of w × 2 k-1 pages. This produces 1 sorted chunk of w × 2 k = P(R) pages.

Prof. Ghandeharizadeh16 EXTERNAL SORTING (Cont…) Example: Sort a 20 page relation assuming a five page buffer pool. Merge sort

Prof. Ghandeharizadeh17 EXTERNAL SORTING (Cont…) Each pass requires reading and writing of the entire file. The algorithm's I/O cost is: cost(2-way sort) = 2 × P(R) × [log 2 (P(R)/w) +1] If P(R)/w is not a power of two then we take the ceiling of the logarithm.

Prof. Ghandeharizadeh18 EXTERNAL SORTING (Cont…) Second approach, (w-1)-way merge sort: 1: Read in, sort in main memory, and write out chunks of w pages. This step produces (w-1) l sorted chunks of w pages. 2: Read in, merge in main memory, and write out sets of (w-1) sorted chunks of w pages. This produces (w-1) l-1 sorted chunks of (w-1) w pages. 3: Read in, merge in main memory, and write out sets of (w-1) sorted chunks of (w- 1)w pages. This produces (w-1) l-2 sorted chunks of (w-1) 2 w pages..:..... l: Read in, merge in main memory, and write out sets of (w-1) sorted chunks of (w- 1) l-2 w pages. This produces w-1 sorted chunks of (w-1) l-1 w pages. l+1:Read in, merge in main memory, and write out sets of (w-1) sorted chunks of (w- 1) l-1 w pages. This produces 1 sorted chunk of (w-1) l w = P(R) pages.

Prof. Ghandeharizadeh19 EXTERNAL SORTING (Cont…) Merging (w-1) temporary files is done by reading one page from each stream and simultaneously comparing the values of (w-1) tuples, one from each stream. Example: Sort a 20 page relation assuming a five page buffer pool.

Prof. Ghandeharizadeh20 EXTERNAL SORTING (Cont…) Each pass requires reading and writing of the entire file. The algorithm's I/O cost is: cost((w-1)way sort) = 2 × P(R) × [ l +1] = 2 × P(R) × [log w-1 (P(R)/w) + 1] How does the I/O cost of 2-way merge sort compares with (w-1)-way merge sort for the 20 page relation? 2-way merge sort is more expensive in I/O, but it is simpler to implement. Overall it looses because the time attributed to performing I/O operations is more significant as compared to CPU execution time of a simple routine.

Prof. Ghandeharizadeh21 EQUALITY JOIN Estimated number of output tuples is: t(R) × t(S) / ν(A,R) Two algorithms for performing the join operator: nested loops and merge-scan. Nested-loops: Tuple nested loops: for each tuple r in R do for each tuple s in S do if r.A=s.A then output r,s in the result relation end-for Estimated cost of tuple nested loops: –P(R) + [t(R) × P(S)] P(S)P(R)

Prof. Ghandeharizadeh22 EQUALITY JOIN (Cont…) Page nested loops: for each page rp in R do fix rp in the buffer pool for each page sp in S do for each tuple r in rp do for each tuple s in sp do if r.A=s.A then output r,s in the result end-for unfix rp end-for Estimated cost of page nested loops: –P(R) + [P(R) × P(S)] P(R)P(S)

Prof. Ghandeharizadeh23 EQUALITY JOIN (Cont…) Merge-scan: 1.Sort R on attribute A 2.Sort S on attribute A 3.Scan R and S in parallel, merging tuples with matching A values Estimated cost of merge scan: sort(R) + sort(S) + P(R) + P(S) Tuple nested-loops with alternative index structures: –Heap: P(R) + [t(R) × P(S)] –Clustered hash index: P(R) + [t(R) × P(S) / B(I)] –Non-clustered hash index: P(R) + [t(R) × (P(I) / B(I) + t(S) / ν(A,S))] –Clustered B + -tree: P(R) + [t(R) × (d(I R,A ) + P(S) / ν(A,S))] –Non-clustered B + -tree: P(R) + [t(R) × (d(I R,A ) + lp(I) / ν(A,S) + t(S) / ν(A,S))]

Prof. Ghandeharizadeh24 EQUALITY JOIN (Cont…) Merge scan with alternative index structures. It is not important whether a relation is hashed or not. Assume w buffer pages. We sort a relation using either a 2-way or (w-1)-way merge sort: Example: Assume the following statistics on the Employee and Department relations: t(Dept)=1000 tuples, P(Dept)=100 disk pages, ν(Dept,dno)=1000, ν(Dept,dname)=500. t(Employee)=100,000 tuples and P(Employee)=10,000 pages. Note that 10 tuples of each relation fit on a disk page. Assume that a concatenation of one Employee and one Dept record is wide enough to enable a disk page to hold five such records. Lets compare the cost of two alternative algebraic expressions for processing a query that retrieves those employees that work for the toy department: – б dname=Toy (Emp Dept) –Emp б dname=Toy (Dept)

Prof. Ghandeharizadeh25 EQUALITY JOIN (Cont…) б dname=Toy (Emp Dept) EmpDept Tmp1 б dname=Toy Tmp2 t(Tmp1) = t(Emp) × t(Dept) / ν(Dept, dno) = 100,000 ×1000 / 1000 = 100,000 P(Tmp1) = 100,000 / 5 = 20,000 C( ) = P(Dept) + P(Emp) ×P(Dept) = ,000 ×100 = 1,000,100 (page nested loop) C w (Tmp1) = 20,000 C w (б) = 20,000 t(Tmp2) = t(Tmp1) / ν(Dept, dname) = 100,000 / 500 = 200 P(Tmp2) = 200 / 5 = 40 C w (Tmp2) = 40 Cost = C( ) + C w (Tmp1) + C(б) + C w (Tmp2) = 1,000, , , = 1,040,140 I/O

Prof. Ghandeharizadeh26 EQUALITY JOIN (Cont…) Emp б dname=Toy (Dept) Emp Dept Tmp1 б dname=Toy Tmp2 C w (б) = 100 t(Tmp1) = t(Dept) / ν(Dept, dname) = 1000 / 500 = 2 P(Tmp1) = 1 C w (Tmp1) = 1 C( ) = P(Tmp1) + P(Tmp1) ×P(Emp) = 1 +1×10,000 = 10,001 (page nested loop) t(Tmp2) = t(Emp) × t(Tmp1) / ν(Emp,dno) = 100,000 ×2 / 1000 = 200 P(Tmp2) = 200 / 5 = 40 C w (Tmp2) = 40 Cost = C(б) + C w (Tmp1) + C( ) + C w (Tmp2) = , = 10,142 I/O

Prof. Ghandeharizadeh27 EQUALITY JOIN (Cont…) The role of a query optimizer is to re-arrange operators of a query-tree to minimize the cost of executing a query. It employs heuristic search to compute a low-cost execution-tree: 1.Given a query plan, push the selection and projection operators down the query-tree. The system must be careful with the projection operator. 2.Combine two operators whenever possible, e.g., selection and join can be combined, projection and selection can be combined, projection and join can be combined. 3.Join is associative, consider different orderings of the join operators: (R S) T = R (S T) –Never compute the Cartesian product of two relations. For example if the join clause is (R.A = S.A and R.B = T.B) then it would be a mistake to use the following clause (S Cartesian product T) and R.A = ST.A and R.B = ST.B –Always join an intermediate result with an original relation: (((R S) T) U) is Okay (R S) (T U) is not Okay