CS 345: Topics in Data Warehousing Thursday, October 21, 2004.

CS 345: Topics in Data Warehousing Thursday, October 21, 2004

Review of Tuesday’s Class Database System Architecture –Memory management –Secondary storage (disk) –Query planning process Joins –Nested Loop Join –Merge Join –Hash Join Grouping –Sort vs. Hash

Outline of Today’s Class Indexes –B-Tree and Hash Indexes –Clustered vs. Non-Clustered –Covering Indexes Using Indexes in Query Plans Bitmap Indexes –Index intersection plans –Bitmap compression

Indexes Provide efficient access to relevant records –Based on values of particular attribute(s) Same idea as index in back of a book “fact tables 16, 17, 49” –Information about fact tables on pages 16, 17, and 49 –No information about fact tables on other pages –Without an index, we’d have to look through the whole book page by page

Typical Index Structure Indexes organized based on some search key –Column (or set of columns) whose values are used to access the index –Organization can be sorting or hashing Index is built for some relation –One index entry per record in the relation Index consists of pairs –Value = value of the search key for this record –RID = record identifier Tells the DBMS where the record is stored Usually (page number, offset in page)

Sorted Index Index entries usually much smaller than records –Record has many attributes besides search key Build search tree on top of index entries –Allows particular value to be located quickly 244578 25

B-Tree Index By far the most common type of index Sorted index with search tree Good for point queries and range queries –Point query: A = 5 –Range query: A BETWEEN 5 AND 10 Search tree nodes are page-sized –Contain pairs –Each Pointer is to a node of the level below Trade-off in choosing index page sizes –Larger pages → fewer search tree levels → fewer page reads –Larger pages → each page read takes longer

Hash Indexes Useful for point queries –Slightly better performance than B-Trees –Not useful for range queries Less widely supported than B-Trees

Alternate B-Tree Organization Many records with same search key causes redundancy –,,, Can store RID-lists instead – –Each value occurs once in the index –Index entry is instead of –Saves space when search key has many repeated values

Clustered Indexes An index is clustered (or “clustering”) if records in the relation are organized based on index search key Clustered indexes are good because: –Records satisfying a range query are packed onto a small number of consecutive pages In unclustered indexes, by contrast: –Records satisfying a range query are spread across a large number of random pages –Commingled with other records that do not satisfy the query Only one clustered index allowed per relation –A relation can’t be simultaneously sorted by 2 different attributes –(Unless there are multiple copies of the relation)

Clustered vs. Unclustered 244578 4 2 7 4 5 8 25 244578 2 5 4 7 4 8 25 Clustered Unclustered Sequential Reads Random Reads

Comparing Access Plans Consider query “SELECT * FROM R WHERE A=5” Three query plans: –Scan relation R Sequential read of all pages in R Regardless of how many tuples have A=5 –Use clustered index on A Sequential read of relevant pages in R Num. relevant pages = (# of tuples with A=5) / (# of tuples per page) Plus overhead of accessing index pages –Use unclustered index on A Random read of relevant pages in R Number of relevant pages = (# of tuples with A=5) –Less if A is highly correlated with sort order of relation Plus overhead of accessing index pages

Comparing Access Plans Clustered index is always best –Unless all tuples are being returned (then use scan) –But clustered index may not be available Unclustered index beats scan when fraction of tuples returned is small –Depends on these factors: % of tuples being returned Cost ratio of random I/O vs. sequential I/O # of tuples per page –Query returns >10% of rows → scan is almost certainly faster

Covering Indexes Example using index in a book: –“What does this book say about fact tables?” Look up “fact tables” in the index Turn to each page that is listed Read that page and see what it says –“Which of these topics are discussed in this book: fact tables, bridge tables, B-trees?” Look up the three topics in the index See how many of them appear Don’t need to read any of the actual book

Covering Indexes Sometimes an index has all the data you need –Allows index-only query plan –Not necessary to access the actual tuples –Such an index is called a covering index SELECT COUNT(*) FROM R WHERE A=5 –Use index on A –Count number of entries –No need to look up records referenced by RIDs An index is a “thin” copy of a relation –Not all columns from the relation are included –The index is sorted in a particular way

Multi-Column Indexes Multi-column indexes are very useful in data warehousing –We say such an index has a composite key Example: B-Tree index on (A,B) –Search key is (A,B) combination –Index entries sorted by A value –Entries with same A value are sorted by B value –Called a lexicographic sort SELECT SUM(B) FROM R WHERE A=5 –Our (A,B) index covers this query! Coverage vs. size trade-off –More attributes in search key → index covers more queries –More attributes in search key → index takes up more disk space

Fact and Dimension Indexes Dimension table index Narrow version of table with only frequently-queried attributes Always include dimension key! Improve performance on large dimension tables Fact table index Narrow version of fact that omits certain dimensions / measures Useful for queries that exclusively reference indexed dimensions / measures

Order of Composite Key Index on (A,B) ≠ Index on (B,A) –Can efficiently search based on leading terms –No efficient search for trailing terms SELECT SUM(B) FROM R WHERE A=5 –Index on (A,B) is sorted by A Search for records where A=5 Scan only the relevant portion of the index –Index on (B,A) is sorted by B Records with A=5 are scattered throughout index Need to scan the entire index Or else do one search for each distinct value of B –Oracle’s “index skip scans” –Index on (A,B) is better for this query –Either index is much faster than accessing relation!

Index Summary Indexes are useful in two ways: –Indexes allow efficient search on some attributes due to the way they are organized –Index-only plans use small indexes in place of large relations For OLAP queries, the second use is generally more important –Search via non-covering, non-clustered index leads to random I/O –Analysis queries typically aggregate lots of tuples –Doing one random I/O per tuple can be costly

Example Sales(Date, Store, Product, Promotion, TransactionId, Quantity, DollarAmt) –Index on (Date, Store, Quantity, DollarAmt) –Index on (Date, Promotion, Product, Quantity, DollarAmt) –Index on (Product, Date, Store, Quantity, DollarAmt) Store –Index on (Name, District, StoreKey) Product –Index on (Name, Brand, Dept, ProductKey) –Index on (Brand, Dept, ProductKey)

Example Query SELECT Brand, SUM(DollarAmt) FROM Sales, Product, Store WHERE Sales.ProductKey = Product.ProductKey AND Sales.StoreKey = Store.StoreKey AND Store.Name = 'Crystal Springs Safeway‘ GROUP BY Brand Product: Brand Store: Name Sales: DollarAmt

Selecting Indexes Sales(Date, Store, Product, Promotion, TransactionId, Quantity, DollarAmt) –Index on (Date, Store, Quantity, DollarAmt) –Index on (Date, Promotion, Product, Quantity, DollarAmt) –Index on (Product, Date, Store, Quantity, DollarAmt) Store –Index on (Name, District, StoreKey) Product –Index on (Name, Brand, Dept, ProductKey) –Index on (Brand, Dept, ProductKey) Lacks Product Lacks Store Wider Than Needed

Query Plan Search Store(Name, District, StoreKey) index for Name=‘Crystal Springs Safeway’ Nested Loop Join –Outer = Sales(Product,Date,Store,Quantity,DollarAmt) index –Inner = Qualifying Store index entries –Output preserves sort order of Sales index Sort Product(Brand,Dept,ProductKey) index entries by ProductKey Merge Join –Result of Nested Loop Join (already sorted by ProductKey) –Product(Brand,Dept,ProductKey) Hash resulting tuples on Brand (for GROUP BY) –Compute SUM(DollarAmt) for each Brand

Index Intersection Suppose we have table R(A,B,C,D,E) –B-Tree index on A –B-Tree index on B –No multi-column indexes SELECT COUNT(*) FROM R WHERE A=5 AND B < 10 Use an index intersection plan –Search A index for A=5 Index entries have Think of the index as a 2-column table with schema I1(A,RID) –Search B index for B<10 Index entries have Think of the index as a 2-column table with schema I2(B,RID) –Join qualifying index entries on I1.RID = I2.RID

Index Intersection Index intersection works well for conjunction of multiple, moderately selective filters –SELECT SUM(C) FROM R WHERE A=5 AND B<10 –5% of rows have A=5 –5% of rows have B<10 –5% * 5% = 0.25% of rows have A=5 AND B<10 –Retrieving rows matching A index alone, or B index alone, would be slow –Only a few rows match both indexes Intersect indexes and retrieve rows that match both –Overhead of joining indexes often small relative to cost of retrieving matching records from relation

Bitmap Indexes Earlier idea: use RID-lists in place of RIDs –Save space when attribute values repeat Bitmap indexes take this one step further –Use Bitmap in place of RID-list –Each RID in the entire relation is represented by 1 bit 1 = RID is present in RID-list 0 = RID is absent from RID-list –Bitmaps are usually compressed E.g using run-length encoding

Bitmap Index Example Bitmap index looks like this: IDNameSex 1FredM 2JillF 3JoeM 4FranF 5EllenF 6KateF 7MattM 8BobM

Why Bitmap Indexes? Index intersection plans with bitmap indexes are fast –Just perform bitwise AND! –Index intersection with B-Trees requires a join SELECT COUNT(*) FROM R WHERE A=5 AND B < 10 –Bitmap index on A –Bitmap index on B –OR together bitmaps for B values that are < 10 –AND the result with the bitmap for A=5 –Can be computed very quickly Assuming not too many distinct B values that are < 10 Save space for low-cardinality attributes –As compared to a B-Tree or Hash index –Particularly if compression is used Most useful for attributes with low or medium cardinality –Not good for something like LastName

Compressing Bitmaps Consider a bitmap index on an attribute with 20 distinct values Each row has 1 value for that attribute 20 different bitmaps –i th bit is set to 1 in one bitmap –i th is set to 0 in 19 bitmaps Bitmaps consist mostly of zeros (95% of bits are zero) –Good opportunity for compression Compression via run length encoding –Just record number of zeros between adjacent ones –00000001000010000000000001100000 –Store this as “7,4,12,0,5” Compression Pros and Cons –Reduce storage space → reduce number of I/Os required –Need to compress/uncompress → increase CPU work required –Each compression scheme negotiates this trade-off differently –Operate directly on compressed bitmap → improved performance

CS 345: Topics in Data Warehousing Thursday, October 21, 2004.

Similar presentations

Presentation on theme: "CS 345: Topics in Data Warehousing Thursday, October 21, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 345: Topics in Data Warehousing Thursday, October 21, 2004.

Similar presentations

Presentation on theme: "CS 345: Topics in Data Warehousing Thursday, October 21, 2004."— Presentation transcript:

Similar presentations

About project

Feedback