Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park
Outline Overview(already covered) Measures of Query Cost(already covered) Selection Operation(already covered) Sorting(already covered) Join Operation Other Operations Evaluation of Expressions
Join Operation Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join Choice based on cost estimate Examples use the following information Number of records of student5,000takes10,000 Number of blocks of student100takes400
Nested-Loop Join (1/2) To compute the theta join r s for each tuple t r in r do begin for each tuple t s in s do begin test pair (t r, t s ) to see if they satisfy the join condition if they do, add t r t s to the result end end r is called the outer relation and s the inner relation of the join Requires no indices and can be used with any kind of join condition Expensive since it examines every pair of tuples in the two relations
Nested-Loop Join (2/2) In the worst case, if the buffer can hold only one block of each relation, the estimated cost is n r b s + b r block accesses If the smaller relation fits entirely in memory, use that as the inner relation. The cost is reduced to b r + b s block accesses Assuming worst case memory availability, cost estimate is 5000 = 2,000,100 block accesses with student as outer relation = 1,000,400 block accesses with takes as outer relation If smaller relation (student) fits entirely in memory, the cost estimate will be 500 block accesses Block nested-loop join algorithm (next slide) is preferable
Block Nested-Loop Join Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation for each block B r of r do begin for each block B s of s do begin for each tuple t r in B r do begin for each tuple t s in B s do begin Check if (t r, t s ) satisfy the join condition if they do, add t r t s to the result end end end end Worst case estimate: b r b s + b r block accesses
Indexed Nested-Loop Join Index lookups can replace file scans if Join is an equi-join or natural join and An index is available on the inner relation’s join attribute For each tuple t r in the outer relation r, use the index to look up tuples in s that satisfy the join condition with tuple t r Cost of the join: b r + n r c Where c is the cost of a single selection using the join condition If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation
Merge-Join (1/2) Sort both relations on their join attribute (if not already sorted on the join attributes) Merge the sorted relations to join them
Merge-Join (2/2) Can be used only for equi-joins and natural joins Each block needs to be read only once Thus number of block accesses for merge-join is b r + b s + the cost of sorting if relations are unsorted
Hash-Join (1/2) Applicable for equi-joins and natural joins A hash function h is used to partition tuples of both relations h maps JoinAttrs values to {0,1,…,n} where JoinAttrs denotes the common attributes of r and s used in the natural join r 0, r 1,..., r n denote partitions of r tuples s 0, s 1..., s n denote partitions of s tuples r tuples in r i need only to be compared with s tuples in s i
Hash-Join (2/2)
Other Operations Duplicate elimination can be implemented via sorting or hashing On sorting, duplicates will come adjacent to each other Hashing is similar – duplicates will come into the same bucket Projection is implemented by performing projection on each tuple, which gives a relation that could have duplicate records, and then removing duplicate records Aggregation can be implemented in a manner similar to duplicate elimination Sorting or hashing can be used to bring tuples in the same group together, and then the aggregate functions can be applied on each group
Evaluation of Expressions So far, we have studied how individual operations are carried out; now we consider how to evaluate an expression containing multiple operations The obvious way is simply to evaluate one operation at a time, in an appropriate order; the result of each evaluation is materialized in a temporary relation for subsequent use A disadvantage to this method is the need to build the temporary relation, which (unless they are small) must be written to disk An alternative approach is to evaluate several operations simultaneously in a pipeline with the results of one operation passed on to the next, without the need to store a temporary relation
Materialization Evaluate one operation at a time, starting at the lowest-level; store intermediate results into temporary relations to evaluate next-level operations Consider the expression: ∏ name ( building = “Watson” (department) instructor)
Pipelining Evaluate several operations simultaneously, passing the results of one operation to the next Consider the previous example again: Don’t store result of building=“Watson” (department) Instead, pass tuples directly to the join Similarly, don’t store result of join, pass tuples directly to the projection Much cheaper than materialization Pipelining may not always be possible