Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections 14.1-14.3; 14.5-14.7)

Similar presentations


Presentation on theme: "CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections 14.1-14.3; 14.5-14.7)"— Presentation transcript:

1 CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections 14.1-14.3; 14.5-14.7)

2 CPSC 404, Laks V.S. Lakshmanan2 What you will learn from this lecture v General selections. v Union/intersection. v Group-by.

3 CPSC 404, Laks V.S. Lakshmanan3 Simple Selections v Of the form v Size of result approximated as size of R * reduction factor – reduction factor typically estimated with the uniformity assumption – or estimated with the help of a histogram [will discuss soon]. v With no index, unsorted: (worst case)  table scan only option; cost = M (#pages of R). v With an index on selection attribute: Use index to find qualifying data entries, then retrieve corresponding data records. (Hash index useful only for equality selections.) v What about a sorted indexless relation? Is binary search always superior to table scan? SELECT * FROM Songs S WHERE S.sname = ‘C%’

4 CPSC 404, Laks V.S. Lakshmanan4 Using an Index for Selections v Cost depends on #qualifying tuples, and clustering. – Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large w/o clustering). – For example, assuming uniform distribution of snames(!), about 1/26 of tuples qualify (20 pages, 1600 tuples). With a clustered index, cost is about 20 I/Os; if unclustered, upto 1600 I/Os! [assumes S has 500 pages.] v Important refinement for unclustered indexes: 1. Find qualifying data entries. 2. Sort the rid’s of the data records to be retrieved (or, if the records are not stored in rid-order, sort the page addresses). 3. Fetch rids in order. This ensures that each data page is looked at just once (though # of such pages likely to be higher than with clustering).

5 CPSC 404, Laks V.S. Lakshmanan5 General Selection Conditions v Such selection conditions are first converted to conjunctive normal form (CNF): (time<8/9/04 OR rating=5 OR sid=3 ) AND (uid=‘123 OR rating=5 OR sid=3). v We only discuss the case with no disjunctions (most common in practice!) v disjunctions may or may not be easy – e.g., (time<8/9/04 OR uid=‘123) index on uid only may still warrant a scan – e.g., (time<8/9/04 OR uid=‘123) AND sid = 3 index on sid helps tremendously - (time = 8/9/04 OR uid=`123’). Can you think of a clever strategy if given index on time and on uid? * (time<8/9/04 AND uid=123) OR rating=5 OR sid=3

6 CPSC 404, Laks V.S. Lakshmanan6 Two Approaches to General Selections v First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining conditions that don’t match the index: – Most selective access path: An index probe or index scan or table scan that we estimate will require the fewest page I/Os. – e.g., time.  2 options: u option 1: use the B+ tree index on time (check rating=5 and sid=3 for each retrieved tuple) u option 2: use the hash index on (check time<8/9/94) – Conditions that match this index reduce the number of tuples retrieved; other conditions are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.

7 CPSC 404, Laks V.S. Lakshmanan7 Intersection of Rids v Second approach (if we have 2 or more matching indexes): – Get sets of rids of data records using each matching index. – Then intersect these sets of rids – Retrieve the records and apply any remaining conditions. – e.g., time<8/9/04 AND rating=5 AND sid=3 u suppose we have a B+ tree index on time and an index on sid u retrieve rids of records satisfying time<8/9/04 using the first u retrieve rids of records satisfying sid=3 using the second u intersect the rids u finally, retrieve records and check rating=5 u How’d you implement intersection efficiently?

8 CPSC 404, Laks V.S. Lakshmanan8 The Projection Operation v An approach based on sorting: – Modify Phase 1 of external sort to eliminate unwanted fields. Thus, sorted sublists (runs) are produced, but tuples in SSLs are smaller than input tuples. (Size ratio depends on # and size of fields dropped.) – Modify merging passes to eliminate duplicates. Thus, number of result tuples smaller than input. (Difference depends on # of duplicates.) – What should we sort on? SELECT DISTINCT R.sid, R.uid FROM Ratings R

9 CPSC 404, Laks V.S. Lakshmanan9 Projection Based on Hashing v Partitioning phase: Read R using one input buffer. For each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buckets/buffers. – Result is B-1 partitions (of tuples with no unwanted fields). Any two tuples from different partitions guaranteed to be distinct. v Duplicate elimination phase: For each partition, read it and build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates. – If partition does not fit in memory, can apply hash-based projection algorithm recursively to this partition. (but what do you “project” on in recursive invocation(s)?) v Note the similarity to hash join.

10 CPSC 404, Laks V.S. Lakshmanan10 Set Operations v Intersection and cross-product special cases of join. v Union: sorting based and hash based v Sorting based approach: – Sort both relations (on combination of all attributes). – Scan sorted relations and merge them. – Alternative: Merge SSLs from Phase 1 for both relations. – An advantage: result is sorted. v Hash based approach to union: – Partition R and S using hash function h. – For each S-partition, build in-memory hash table (using h2), scan corr. R-partition and add tuples to table while discarding duplicates. – What is the buffer requirement for this?

11 CPSC 404, Laks V.S. Lakshmanan11 Aggregate Operations ( AVG, MIN, etc.) v Without group-by: – In general, requires scanning the relation. – Given index whose search key includes all attributes in the SELECT & WHERE clauses, may suffice to do index scan. (remember what index scan is?) v With group-by: – Sort on group-by attributes, then scan relation and compute aggregate for each group. (Can improve upon this by combining sorting and aggregate computation. In this case, the I/O cost is just that of sorting.) – Similar approach based on hashing on group-by attributes. u Maintain running info. for each group.

12 CPSC 404, Laks V.S. Lakshmanan12 Summary v A virtue of relational DBMSs: queries are composed of a few basic operators; the implementation of these operators can be carefully tuned (and it is important to do this!). v Indeed – relational algebra  gold standard for algebra design: simple ops; sophisticated ones derivable by composition: e.g., join, division. v Many alternative implementation techniques for each operator; no universally superior technique for most operators. v Must consider available alternatives for each operation in a query and choose best one based on system statistics, etc. This is part of the broader task of optimizing a query composed of several ops.


Download ppt "CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections 14.1-14.3; 14.5-14.7)"

Similar presentations


Ads by Google