Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review for Midterm 2 Shahram Ghandeharizadeh. Midterm 2 Scheduled for April 30 th Scheduled for April 30 th 4 papers 4 papers  Variant indexes.  Access.

Similar presentations


Presentation on theme: "Review for Midterm 2 Shahram Ghandeharizadeh. Midterm 2 Scheduled for April 30 th Scheduled for April 30 th 4 papers 4 papers  Variant indexes.  Access."— Presentation transcript:

1 Review for Midterm 2 Shahram Ghandeharizadeh

2 Midterm 2 Scheduled for April 30 th Scheduled for April 30 th 4 papers 4 papers  Variant indexes.  Access path selection.  Overview of query optimization.  Mining Association Rules.  Paper on cache management is not included because it was covered by your project. Midterm 2 is worth 35% of your grade. Midterm 2 is worth 35% of your grade.

3 Variant Indexes A read-mostly database that is updated infrequently. A read-mostly database that is updated infrequently. Complex indexes to speedup queries. Complex indexes to speedup queries. Focuses on physical designs to enhance performance. Focuses on physical designs to enhance performance.

4 Example Data Warehouse Key Observations: Key Observations:  A handful of products, a PROD table with tens of rows.  Many millions of rows for SALES tables. CidPidDayAmtdollar_costUnit_sales SALES PidNameSizeWeightPackage_type PROD DayWeekMonthYearHollidayWeekday TIME

5 A B+-Tree on Major Holidays A B+-tree index on different holidays of the SALES table. A B+-tree index on different holidays of the SALES table. Joe, Big Mac, Lab day, … Mary, Fries, Pres day, … Harry, Big Mac, Pres day, … Henry, Big Mac, Pres day, … Jane, Happy Meal, Pres day, … Shideh, Happy Meal, Pres day, … Kam, Happy Meal, Pres day, … Bob, Big Mac, Pres day, … (Pres day, (1,2), (1, 3), (1, 4), (2,1), …. B+-tree Leaf page Value List

6 A B+-Tree on Major Holidays A B+-tree index on different holidays of the SALES table. A B+-tree index on different holidays of the SALES table. Joe, Big Mac, Lab day, … Mary, Fries, Pres day, … Harry, Big Mac, Pres day, … Henry, Big Mac, Pres day, … Jane, Happy Meal, Pres day, … Shideh, Happy Meal, Pres day, … Kam, Happy Meal, Pres day, … Bob, Big Mac, Pres day, … (Pres day, (1,2), (1, 3), (1, 4), (2,1), …. B+-tree Leaf page Value List RID List

7 Conjunctive Queries Count number of Big Mac Sales on “President’s Day” assuming a B+-tree on product (pid) and day of SALES Count number of Big Mac Sales on “President’s Day” assuming a B+-tree on product (pid) and day of SALES With RID-Lists With RID-Lists  Get the Value-List for “Big Mac” using the B+- tree, obtain RID-List1.  Get the Value-List for “President’s Day” using the B+-tree, obtain RID-List2.  Compute set-intersect of RID-List1 and RID-List2  Count the number of RIDs in the intersection set. Is there a better way? Is there a better way?  Yes, use bit-maps and logical bit-wise operands.

8 Bitmap Indexes Use a bitmap to represent the existence of a record with a certain attribute value. Use a bitmap to represent the existence of a record with a certain attribute value. Example: If a record has the indexed attribute value “Big Mac” then its corresponding entry in the bitmap is set to one. Otherwise, it is a zero. Example: If a record has the indexed attribute value “Big Mac” then its corresponding entry in the bitmap is set to one. Otherwise, it is a zero.

9 A Bitmap A Bitmap B is defined on T as a sequence of M bits. A Bitmap B is defined on T as a sequence of M bits. For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Pres Day, 0100001100111111110000011001….. Record 0

10 A Bitmap A Bitmap B is defined on T as a sequence of M bits. A Bitmap B is defined on T as a sequence of M bits. For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Pres Day, 0100001100111111110000011001….. Record 1

11 A Bitmap A Bitmap B is defined on T as a sequence of M bits. A Bitmap B is defined on T as a sequence of M bits. For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Pres Day, 0100001100111111110000011001….. Record 2

12 A B+-Tree on Major Holidays A B+-tree index on different holidays of the SALES table. A B+-tree index on different holidays of the SALES table. Joe, Big Mac, Lab day, … Mary, Fries, Pres day, … Harry, Big Mac, Pres day, … Henry, Big Mac, Pres day, … Jane, Happy Meal, Pres day, … Shideh, Happy Meal, Pres day, … Kam, Happy Meal, Pres day, … Bob, Big Mac, Pres day, … (Pres day, 01111111…. B+-tree Leaf page

13 Conjunctive Queries Count number of Big Mac Sales on “President’s Day” assuming a B+-tree on product (pid) and day of SALES Count number of Big Mac Sales on “President’s Day” assuming a B+-tree on product (pid) and day of SALES With RID With RID  Get the Value-List for “Big Mac” using the B+-tree, obtain RID-List1.  Get the Value-List for “President’s Day” using the B+-tree, obtain RID-List2.  Compute set-intersect of RID-List1 and RID-List2  Count the number of RIDs in the intersection set. With bit maps With bit maps  Get the Value-List for “Big Mac” using the B+-tree, obtain bit-map1.  Get the Value-List for “President’s Day” using the B+-tree, obtain bit-map2.  Recall Existence Bitmap (EBM) identify rows that exist.  Let RES = logical AND of bit-map1, bit-map2, and EBM.  Count the number of bits set to one to identify how many Big Macs were sold on “President’s Day”.

14 Variant Indexes Midterm 2 ignores: Midterm 2 ignores:  MEDIAN, N-TILE, Column-Product as aggregates.  Section 5.

15 Access Path Selection Formulates a cost prediction for each access plan, using the following cost formula: Formulates a cost prediction for each access plan, using the following cost formula: COST = Page fetches + W * (RSI Calls) W is an adjustable weighting factor between I/O and CPU. W is an adjustable weighting factor between I/O and CPU. RSI calls is an approximation for CPU utilization. RSI calls is an approximation for CPU utilization. Assumptions: Assumptions:  WHERE tree is considered to be in conjunctive normal form,  Every disjunct is called a boolean factor.

16 How? Enumerating the different execution plans, Enumerating the different execution plans, Estimate the cost of performing each plan, Estimate the cost of performing each plan, Pick the cheapest plan. Pick the cheapest plan. Definition of cost is as follows: Definition of cost is as follows: COST = Page fetches + W * (RSI Calls)

17 Clustered B + -Tree A B+-tree on the gpa attribute A B+-tree on the gpa attribute Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (3, 1)) (3.8, (3,2)) (3.8, (3,3)) (3.9, (4,2)) (4, (4,3)) (3.8, (3,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (1, 1)) (2.5, (1,2)) (2.8, (1,3)) (3.1, (2,2)) (3.2, (2,3) (2.8, (1,4)) (3, (2,1)) (3.5, (2,4)) 3.6

18 Non-Clustered B + -Tree A random I/O for every qualifying record A random I/O for every qualifying record Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (1, 1)) (3.8, (3,2)) (3.8, (2,1)) (3.9, (2,4)) (4, (3,1)) (3.8, (1,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (4, 2)) (2.5, (2,3)) (2.8, (2,2)) (3.1, (3,3)) (3.2, (1,3) (2.8, (3,4)) (3, (1,2)) (3.5, (4,3)) 3.6

19 Questions How are relations and segments related? If slide #4, you state segments may contain more than one relation, and then the next bullet says “at most one relation per segment.”. What is going on? How are relations and segments related? If slide #4, you state segments may contain more than one relation, and then the next bullet says “at most one relation per segment.”. What is going on?

20 Questions Best clarified with an example: Best clarified with an example:  Segment 1 may contain the Emp, Dept, and Revenues tables/relations.  The Emp relation can be assigned to Segment 1 only. It may NOT be assigned to both Segments 1 and 2.

21 Questions The cost of retrieving a range of records from a clustered B+-tree: Should not this be (depth_of_B+-tree + F(pred) * TCARD) or something related to the depth of the tree rather than NINDX, since you only have to navigate through to the leaf nodes once and do a record scan once you’ve reached the correct leaf node? The cost of retrieving a range of records from a clustered B+-tree: Should not this be (depth_of_B+-tree + F(pred) * TCARD) or something related to the depth of the tree rather than NINDX, since you only have to navigate through to the leaf nodes once and do a record scan once you’ve reached the correct leaf node? NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I. TCARD(T), the number of pages in the segment that hold tuples of relation T.TCARD(T), the number of pages in the segment that hold tuples of relation T.

22 Question Should not the cost for a non-clustered B+- tree also involve the depth of the tree rather than NINDX? Should not the cost for a non-clustered B+- tree also involve the depth of the tree rather than NINDX? NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I. TCARD(T), the number of pages in the segment that hold tuples of relation T.TCARD(T), the number of pages in the segment that hold tuples of relation T.

23 Question Should not the cost for a non-clustered B+- tree also involve the depth of the tree rather than NINDX? Should not the cost for a non-clustered B+- tree also involve the depth of the tree rather than NINDX?  It should include the depth of the tree.  NINDX must be included because the leaf pages of the B+-tree must be visited for the qualifying records. NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I. TCARD(T), the number of pages in the segment that hold tuples of relation T.TCARD(T), the number of pages in the segment that hold tuples of relation T.

24 Questions Why is the cost of a merge-scan NINDX(R) + NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit every page that contains tuples of R and S? Why is the cost of a merge-scan NINDX(R) + NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit every page that contains tuples of R and S? NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I. TCARD(T), the number of pages in the segment that hold tuples of relation T.TCARD(T), the number of pages in the segment that hold tuples of relation T.

25 Questions Why is the cost of a merge-scan NINDX(R) + NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit every page that contains tuples of R and S? Why is the cost of a merge-scan NINDX(R) + NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit every page that contains tuples of R and S?  The merge-scan employs the sorted order of the entries in the leaf pages of the B+-tree index structures. NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I. TCARD(T), the number of pages in the segment that hold tuples of relation T.TCARD(T), the number of pages in the segment that hold tuples of relation T.

26 Overview of Query Optimization Extends discussion to: Extends discussion to:  Correlation queries, use of outer-join to “flatten” nested queries. Raises more questions than providing answers. Raises more questions than providing answers.  A good starting point for:  A practitioner who wants to build an optimizer for a relational DBMS.  A Ph.D. student interested in writing a dissertation in the area of query optimization techniques.

27 Mining Association Rules Objective: Discover association Rule over basket data. Objective: Discover association Rule over basket data. Motivation: valuable for cross-marketing and attached mailing applications. Motivation: valuable for cross-marketing and attached mailing applications.  Example: 98% of customers who purchase tires and auto accessories also get automotive services done. Key contributions: Key contributions:  Fast algorithms:  Apriori, AprioriTid, and AprioriHybrid Pay attention to terminology, definitions, and the general framework. Pay attention to terminology, definitions, and the general framework.

28 Database Systems This course has introduced you to: This course has introduced you to:  A storage manager and its use, BDB.  Spatial indexing, R-Trees.  Parallel DBMS.  Alternative technologies for applications that do not require ACID transactions.  Google FS, MapReduce, etc.  Optimization techniques for relational DBMSs.  Knowledge Discovery.  Efficient query processing techniques.  RAID and Use of flash memory in enterprises:  Steve Kleiman’s lecture.

29 Database Systems Many important topics remain: Many important topics remain:  Data mining, Data cubes, Data visualization techniques. Papers are from 1990s: Neumerous follow- on papers! Papers are from 1990s: Neumerous follow- on papers!


Download ppt "Review for Midterm 2 Shahram Ghandeharizadeh. Midterm 2 Scheduled for April 30 th Scheduled for April 30 th 4 papers 4 papers  Variant indexes.  Access."

Similar presentations


Ads by Google