Notions of clustering Clustered file: e.g. store movie tuples together with the corresponding studio tuple. Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with same search key are stored together.
Index-based algorithms: selection To evaluate a= (R) use index on a, if it exists Cost: cost of index lookup (negligible) plus If index is clustering: B(R)/V(R,a) I/O’s (the fraction of the relation with some value for a) Otherwise, an approximation is that each tuple we retrieve is in a different block, so we get: T(R)/V(R,a) I/O’s
Example of index-based selection a= (R), and B(R) = 1000, T(R) =20,000 R is clustered, and no index on attribute a 1000 disk I/O’s R is unclustered, and no index on attribute a 20,000 I/O’s R has a clustering index on a, V(R,a) = 100 10 I/O’s R has a non-clustering index on a, V(R,a) = 100 20,000/100 = 200 disk I/O’s V(R,a) = 20,000 (i.e. attribute a is key) just 1 I/O
Index joins We want to compute R(X,Y) S(Y,Z) Suppose there is a Y-index on S. For each tuple t of R, lookup all tuples in S with key-value t[Y] and output the join of t. Cost: B(R) to read R (clustered case) --- We ignore this cost Each tuple of R joins with T(S)/V(S,Y) tuples of S, on average. S has a non-clustered index on Y: T(R)T(S)/V(S,Y) S has a clustered index on Y: T(R)B(S)/V(S,Y)
Example of index-join T(R) = 10,000, B(R) = 1000 T(S) = 5000, B(S) = 500, V(S,Y) = 100 To compute R(X,Y) S(Y,Z) using a clustered Y-index on S: ,000*(500/100) = 51,000 I/O’s Bad!!
However, things are not so bad in practice Suppose we have the relations: StarsIn(title, year, starName) MovieStar(name, address, gender, birthdate) And there is an index on MovieStar.name Consider the SQL query: SELECT birthdate FROM StarsIn, MovieStar WHERE title = 'King Kong' AND starName = name;
We can first do the selection of those tuples in StarsIn relation with title=‘King Kong’. Suppose they are 10 such tuples. Now, we know that stars take care to not have the same name with some other star. So, name is a key for the relation MovieStar. (V(MovieStar, name) = ?) Hence, V(MovieStar, name) = T(MovieStar) Finally the number of I/O’s is: B(StarsIn) + T( name=‘King Kong’ (StarsIn)) I/O’s for R clustered and T(StarsIn) + T( name=‘King Kong’ (StarsIn)) I/O’s for R non-clustered. Practice (Cont’d)
Joins using sorted indexes We want to compute R(X,Y) S(Y,Z) If S has a B-tree index on Y, Create sorted sublists of R only, and Do a sort join, extracting the S-tuples in order through the index Of both have B-tree index on Y, do a zigzag-join.
Example (B-Tree index on S[Y]) T(R) = 10,000, B(R) = 1000, T(S) = 5000, B(S) = 500, V(S,Y) = 100, S has a B-Tree index on Y Assume that both relations and the indexes are clustered. M = 101 buffers Create 10 sorted sublists of R. Cost: 2B(R) 10 buffers for sublists of R, 1 buffer for S (retrieved via index) Join tuples from input buffers Total cost: 2B(R) + B(R) + B(S) + index lookup = index lookup = index lookup
Zigzag Join Suppose we have B-Tree indexes on both S[Y] and R[Y]. We can jump back and forth between the indexes finding Y- values that they share in common. Tuples from R with Y-value that doesn’t appear in S need never be retrieved, and similarly tuples of S whose Y-value doesn’t appear in R need never be retrieved. Example. Let the Y-values for R be: 1,3,4,4,4,5,6 Let the Y-values for S be: 2,2,4,4,6,7 Start with the 1 and 2. Since 1<2 skip 1 in R. Since 2<3 skip the 2’s in S. Since 3<4 skip 3 in R. Join 4’s. …
Example (Zigzag Join) T(R) = 10,000, B(R) = 1000, T(S) = 5000, B(S) = 500, S and R both have clustered B-Tree indexes on Y There is no need to store either relation. We use just disk I/O’s to read the blocks of R and S through their indexes. We can determine from the indexes alone that a large fracion of R or S cannot match tuples of the other relation, so the cost might be considerably less than 1500 I/O’s.