Download presentation
Presentation is loading. Please wait.
Published byBrice Daniel Modified over 9 years ago
1
CSCI 453 -- Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science Department AUC
2
CSCI 453 -- Query Processing2 Content Why Query Optimization? Optimization Procedure Syntactic and semantic checking Casting the query into some internal representation Heuristic optimization Semantic Query Optimization Systematic Optimization Selection of the Cheapest Plan Code Generation for the Access Plan
3
CSCI 453 -- Query Processing3 Why Query Optimization When users submit queries to a DBMS, they expect a response that is not only correct and consistent; but also timely, that is, it is produced in an acceptable period of time. Queries can be written in a number of different ways, many of them being inefficient. Therefore, the DBMS should take a query and, before it is run, prepare a version that can be executed efficiently, a process that is known as Query Optimization. In practice, a DBMS is concerned with the improvement of execution strategy rather finding the most efficient version, which is effectively impossible for most queries. The larger a database becomes, the greater the need for a query optimizer. It is one of the strengths of relational databases that query optimization can be done automatically by a software optimizer included within the DBMS software.
4
CSCI 453 -- Query Processing4 Optimization Procedure 1. Syntactic and semantic checking. 2. Casting the query into some internal representation. 3. Heuristic optimization (converting to a more efficient form). 4. Semantic query optimization. 5. Systematic optimization. 6. Selection of the cheapest plan. 7. Code generation for the access plan.
5
CSCI 453 -- Query Processing5 1- Syntactic and semantic checking A query is expressed in a high-level language (SQL) and is parsed to check if it syntactically correct (i.e. obeys the rules of SQL grammar). The query is then validated to see if it semantically correct (i.e. verified to see that the attributes, tables, views and other objects actually exist in the database).
6
CSCI 453 -- Query Processing6 2- Casting the query into some internal representation The query is converted into a form more suitable for machine manipulation. In relational data models, the form is based on relational algebra or relational calculus. The relational algebra is commonly used and usually manipulated in the form of operator graphs. S(S#, Sname, Status, City) SP(S#, P#, Qty) Query: Get names of suppliers Who supply part P2. ((S Join SP) Where P#=‘P2’)[Sname]
7
CSCI 453 -- Query Processing7 3- Heuristic optimization (Converting to more efficient form) Heuristic optimization depends on the syntax and not the semantics of the database. It is based only on the general qualities of the relational algebra expressions and involves substituting relational algebra expressions with more efficient expressions using equivalence preserving transformation rules. Objectives: A- Minimize disk input/output and processing by reducing the size of intermediate tables in a query. of intermediate tables in a query. B- Directly reducing the amount of computation involved by rewriting simpler expression. rewriting simpler expression. C- Reducing the amount of computation involved by rewriting expression, although more complex but less computationally demanding. expression, although more complex but less computationally demanding.
8
CSCI 453 -- Query Processing8 A- Rules that tend to reduce the size of intermediate tables 1. Perform selection as early as possible (especially before join) STUDENT (Std#, Sname) COURSE (Course#, Cname, Instructor) RGISTRATION (Std#, Course#, Date) GRADE (Std#, Course#, Grade) Query: List the names of students registered in the database course. ((STUDENT Join (REGISTRATION Join COURSE)) Where Cname = ‘Database’ ) [Sname]
9
CSCI 453 -- Query Processing9 Perform selection as early as possible (especially before join) The selection operator can be pushed as far down the operator graph as possible. At intermediate nodes, the operators are pushed down the appropriate branches.
10
CSCI 453 -- Query Processing10 Perform selection as early as possible (especially before join) The selection operator can be passed down again:
11
CSCI 453 -- Query Processing11 A- Rules that tend to reduce the size of intermediate tables 2. Perform projection as early as possible Under certain conditions projection may be commuted with a join. When a projection is preceded by a join, it is possible to push the projection down before the join, but the projection acquires new attributes, therefore the original projection must be performed after the join. Unless the cardinalities of the intermediate relations are reduced, the usefulness of pushing a projection before a join is questionable.
12
CSCI 453 -- Query Processing12 Perform projection as early as possible (STUDENT Join (((REGISTRATION Join (COURSE Where Cname = ‘Database’)) [Std#, Course#]))) [Sname] The projection, Project(Std#, Course#) should be pushed down the tree!
13
CSCI 453 -- Query Processing13 Perform projection as early as possible
14
CSCI 453 -- Query Processing14 A- Rules that tend to reduce the size of intermediate tables 3. Perform select before project If the selection condition involves only some of the attributes in the projection list, then the two operations can be commuted, e.g.: ((GRADE [Std#, Course#]) Where Std# = 123) can be commuted to: (GRADE Where Std# = 123) [Std#, Course#]
15
CSCI 453 -- Query Processing15 B- Rules that tend to directly reduce the amount of computation involved – making expression simpler 1. Combine a cascade of selections into one selection. Query: Get the full details of courses with course number CSCI-453 where the instructor is Khalil. where the instructor is Khalil. ((COURSE Where Instructor = ‘Khalil’) Where Course# = ‘CSCI-453’) Can be converted to: COURSE Where Instructor = ‘Khalil’ AND Course# = ‘CSCI-453’
16
CSCI 453 -- Query Processing16 B- Rules that tend to directly reduce the amount of computation involved – making expression simpler 2. Combine a cascade of projections into one projection ((COURSE [Cname, Instructor]) [Cname]) Can be converted to: COURSE [Cname]
17
CSCI 453 -- Query Processing17 C- Rules that tend to directly reduce the amount of computation involved – rewriting in a less computationally demanding form Where P OR (Q AND R) Can be rewritten as: Where (P OR Q) AND (P OR R)
18
CSCI 453 -- Query Processing18 4- Semantic Query Optimization Semantic query optimization transformations use constraints on database schema to modify queries. Example: Consider the Join of the two tables: SP (S#, P#, Qty) and P (P#, Pname, Color, Weight, City) If is a foreign key (with no nulls allowed) and is matched to the primary key then If SP.P# is a foreign key (with no nulls allowed) and is matched to the primary key P.P#, then can be transformed to (SP Join P) [S#] can be transformed to SP[S#].
19
CSCI 453 -- Query Processing19 5- Systematic Optimization Having reorganized the query, the Query Optimizer must then consider how to retrieve the information physically from the database. The Query Optimizer generates a (execution strategy, access plan, or execution plan) using access routines (access aids or low level implementation procedures) for the various operations. The Query Optimizer generates a query plan (execution strategy, access plan, or execution plan) using access routines (access aids or low level implementation procedures) for the various operations. In optimizing at this level, an accurate cost estimate for each execution strategy must be calculated. Cost estimation is a time consuming task.
20
CSCI 453 -- Query Processing20 5- Systematic Optimization (Cont’d) Statistical Information: Systematic optimizers may make use of the following information: Number of tuples. Number of blocks used to store these tuples. Number of distinct data values. Percent of total number of relevant database blocks used by the relation. Ordering of tuples in the blocks. Blocking factor for each file. Existence and type of indexes. Number of levels of each index. Number of blocks for packed relations. Physical clustering of records.
21
CSCI 453 -- Query Processing21 5- Systematic Optimization (Cont’d) Types of cost functions: 1- Accesses to secondary storage costs: 1- Accesses to secondary storage costs: Cost of searching for, reading and writing data blocks that reside on secondary storage. Temporary, intermediate files may also need to be accessed; this represents significant problem as there must also be an accurate estimate of the size of intermediate results to calculate the number of I/O required. The access cost is the number of blocks that must be brought into main memory for reading and the number of blocks that must be written out to secondary storage. Cost of searching for, reading and writing data blocks that reside on secondary storage. Temporary, intermediate files may also need to be accessed; this represents significant problem as there must also be an accurate estimate of the size of intermediate results to calculate the number of I/O required. The access cost is the number of blocks that must be brought into main memory for reading and the number of blocks that must be written out to secondary storage. 2- Computation Costs: 2- Computation Costs: Cost of performing in-memory operations in the data buffers during query executions. Operations include: Cost of performing in-memory operations in the data buffers during query executions. Operations include: Searching for records. Sorting records. Merging records for a join. Performing computations on field values. 3- Communication Costs: 3- Communication Costs: Cost of shipping the query and results from database site to terminal where the query originated.
22
CSCI 453 -- Query Processing22 5- Systematic Optimization (Cont’d)Goals: For large databases, the main emphasis is on reducing accesses costs to secondary storage, i.e., the number of block transfers between disk and memory. Small databases, in which most data can be stored in memory focus on minimizing computation. In case of distributed databases, communication costs must also be minimized. It is difficult to include all cost components into a weighted cost function, therefore, most cost functions consider a single factor only or possibly a combination of one factor for estimating I/O and one factor to estimate the use of CPU.
23
CSCI 453 -- Query Processing23 5- Systematic Optimization (Cont’d) Use of Costs: - the number of tuples in relation R. t( R ) - the number of tuples in relation R. - the number of blocks needed to store the relation R, if R is packed b( R ) - the number of blocks needed to store the relation R, if R is packed forms. forms. - the number of tuples per block, also called the blocking factor of bf( R ) - the number of tuples per block, also called the blocking factor of R. R. If R is packed. Then If R is packed. Then b( R) = t( R ) / bf( R ). - the number of distinct values of attribute A in relation R. This can n(A, R) - the number of distinct values of attribute A in relation R. This can be used to approximate the number of tuples (t) that have a be used to approximate the number of tuples (t) that have a particular value. If we assume that the values of A are uniformly particular value. If we assume that the values of A are uniformly distributed in R, then the number of tuples expected to have a distributed in R, then the number of tuples expected to have a particular value c for A (called the selection size): particular value c for A (called the selection size): - average number of records that will satisfy an equality selection s(A=c, R) - average number of records that will satisfy an equality selection condition on an attribute. If we assume that the values for A are condition on an attribute. If we assume that the values for A are uniformly distributed in R, then s = t( R )/n(A,R). uniformly distributed in R, then s = t( R )/n(A,R).
24
CSCI 453 -- Query Processing24 5- Systematic Optimization (Cont’d)Example: Consider the following database: STUDENT (Stuid, Stuname, Major, Credits) ENROLL (Course#, Stuid, Grade) Thus to estimate the number of students in the university with a CS major: - If there are 10,000 students, then t(STUDENT) = 10,000 - If there are 25 possible major subjects, then n(MAJOR, STUDENT) = 25 - Then, we can estimate the number of CS majors as: s(MAJOR=’CS’, STUDENT) = t(STUDENT)/n(MAJOR, STUDENT) = 10,000/25 = 400 - Note that if A is a primary key, then, n(A, R)=t( R ) and the selection size is 1.
25
CSCI 453 -- Query Processing25 5- Systematic Optimization (Cont’d) Processing Joins: In examining the systematic cost of evaluating a typical query, say a Join, most processors focus on the accesses to secondary storage. This cost will involve in not only the effort of retrieving the tables that are input to a query and writing out the final result, but also involve the costs of writing and reading any intermediate tables. This is especially important when considering the size of intermediate results in complex query.
26
CSCI 453 -- Query Processing26 5- Systematic Optimization (Cont’d) To calculate size of a Join, say between R and S, of size t( R ) and t( S ) respectively, we first need to estimate the number of tuples of R that will match of S on the corresponding attributes. There are several distinct possibilities to consider: 1. If there are no common attributes, Join becomes a Product, and the number of tuples in the result is t( R ) * t( S ). 2. If the set of common attributes is a key for one relation, then the number of tuples in the Join can be no larger than the number of tuples in the other relation, e.g., if the common attributes are a key for R then the size of the Join is less than or equal to t( S ).
27
CSCI 453 -- Query Processing27 5- Systematic Optimization (Cont’d) Methods of Processing Joins: Nested loops using blocks. Sort-merge. Using an index or hash key.
28
CSCI 453 -- Query Processing28 5- Systematic Optimization (Cont’d) Nested loops using blocks: Assuming that both R and S are packed relations having b( R ) and b( S ) respectively, then if we have two buffers, we can bring the first block of R into the first buffer and bring each block of S in turn into the second buffer, compare each tuple of the r block with each tuple of the s block before switching in the next s block. When we have finished all the s blocks we bring the next r block into the first buffer and so on. The algorithm can be shown as: For each block of R For each block of S For each tuple in the R block For each tuple in the S block If the tuples satisfy the condition then add to join End
29
CSCI 453 -- Query Processing29 5- Systematic Optimization (Cont’d) Cost Functions for JOIN using Nested Loops: Refer to the text book “Fundamentals of Database Systems”, El Masri: Third Edition: pages 618 – 621 Fourth Edition: pages 527 – 529
30
CSCI 453 -- Query Processing30 5- Systematic Optimization (Cont’d) Sort-Merge Join: In the previous method, the assumption has been made that the tuples in the tables are not sorted in any particular way. If both files are sorted on the attribute to be joined, then another access method, Sort-Merge Join is preferable (see references).
31
CSCI 453 -- Query Processing31 5- Systematic Optimization (Cont’d) Using an Index or Hash Key: If one of the files, S, has an index on the common attribute A, or if A is a hash key, then each tuple of R would be retrieved in the usual way and the index or hashing algorithm would be used to find all the matching records of S. For example, to find STUDENT Join ENROLL, representing S and R respectively, we access each ENROLL record in sequence and then use the index on Stuid to find each matching STUDENT record. The overall cost depends on the type of index as follows: If A is the primary key of S and we have a primary index on S then the access cost is the cost of accessing all blocks of R plus the cost of reading the index and accessing one record of S for each of the tuples in R: b( R ) + (t(R) * (L(indexname) + 1), where L(indexname) is the number of levels in a multilevel index, which is equivalent to the average number of index accesses to find an entry. L(indexname) is the number of levels in a multilevel index, which is equivalent to the average number of index accesses to find an entry. If the index is a clustering index (a file with a clustering index is one in which data records are physically ordered on a non-key field that does not have a distinct value for each record) with selection size s(A=c,S), the cost is: b( R ) + (t(R) * (L(indexname) + s(A=c, S)/bf(S))) If the index is a clustering index (a file with a clustering index is one in which data records are physically ordered on a non-key field that does not have a distinct value for each record) with selection size s(A=c,S), the cost is: b( R ) + (t(R) * (L(indexname) + s(A=c, S)/bf(S))) If we have a hash function (in a hash file, the hash field of a record is subjected to a hashing function which gives the block address that contains that record) instead of an index, the cost is: b( R ) + (t(R) * h) where, b( R ) + (t(R) * h) where, h is the average number of accesses to get a block from the its hash key value. h is the average number of accesses to get a block from the its hash key value. If the index is a secondary index (a secondary index is an ordered file of data values that point to entities in a data file containing this value in one of its fields), we get: If the index is a secondary index (a secondary index is an ordered file of data values that point to entities in a data file containing this value in one of its fields), we get: b( R ) + (t(R) * (L(indexname) + s(A=c,S))) b( R ) + (t(R) * (L(indexname) + s(A=c,S)))
32
CSCI 453 -- Query Processing32 6- Selection of the Cheapest Plan Having generated several access plans, the systematic optimizer would then choose the cheapest.
33
CSCI 453 -- Query Processing33 7- Code Generation for the Access Plan The cheapest plan is then coded for execution at the appropriate time.
34
CSCI 453 -- Query Processing34 Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.