CS 345: Topics in Data Warehousing Thursday, October 21, 2004.

Slides:



Advertisements
Similar presentations
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Advertisements

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
BTrees & Bitmap Indexes
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
CS 345: Topics in Data Warehousing Thursday, October 28, 2004.
1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
CSCE Database Systems Chapter 15: Query Execution 1.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Storage and Indexing1 Overview of Storage and Indexing.
1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope ( )
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Overview of Storage and Indexing Content based on Chapter 4 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 5 Index and Clustering
Query Processing CS 405G Introduction to Database Systems.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
CS4432: Database Systems II
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Database Management Systems (CS 564)
Evaluation of Relational Operations
Evaluation of Relational Operations: Other Operations
File organization and Indexing
Relational Operations
Lecture 2- Query Processing (continued)
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

CS 345: Topics in Data Warehousing Thursday, October 21, 2004

Review of Tuesday’s Class Database System Architecture –Memory management –Secondary storage (disk) –Query planning process Joins –Nested Loop Join –Merge Join –Hash Join Grouping –Sort vs. Hash

Outline of Today’s Class Indexes –B-Tree and Hash Indexes –Clustered vs. Non-Clustered –Covering Indexes Using Indexes in Query Plans Bitmap Indexes –Index intersection plans –Bitmap compression

Indexes Provide efficient access to relevant records –Based on values of particular attribute(s) Same idea as index in back of a book “fact tables 16, 17, 49” –Information about fact tables on pages 16, 17, and 49 –No information about fact tables on other pages –Without an index, we’d have to look through the whole book page by page

Typical Index Structure Indexes organized based on some search key –Column (or set of columns) whose values are used to access the index –Organization can be sorting or hashing Index is built for some relation –One index entry per record in the relation Index consists of pairs –Value = value of the search key for this record –RID = record identifier Tells the DBMS where the record is stored Usually (page number, offset in page)

Sorted Index Index entries usually much smaller than records –Record has many attributes besides search key Build search tree on top of index entries –Allows particular value to be located quickly

B-Tree Index By far the most common type of index Sorted index with search tree Good for point queries and range queries –Point query: A = 5 –Range query: A BETWEEN 5 AND 10 Search tree nodes are page-sized –Contain pairs –Each Pointer is to a node of the level below Trade-off in choosing index page sizes –Larger pages → fewer search tree levels → fewer page reads –Larger pages → each page read takes longer

Hash Indexes Useful for point queries –Slightly better performance than B-Trees –Not useful for range queries Less widely supported than B-Trees

Alternate B-Tree Organization Many records with same search key causes redundancy –,,, Can store RID-lists instead – –Each value occurs once in the index –Index entry is instead of –Saves space when search key has many repeated values

Clustered Indexes An index is clustered (or “clustering”) if records in the relation are organized based on index search key Clustered indexes are good because: –Records satisfying a range query are packed onto a small number of consecutive pages In unclustered indexes, by contrast: –Records satisfying a range query are spread across a large number of random pages –Commingled with other records that do not satisfy the query Only one clustered index allowed per relation –A relation can’t be simultaneously sorted by 2 different attributes –(Unless there are multiple copies of the relation)

Clustered vs. Unclustered Clustered Unclustered Sequential Reads Random Reads

Comparing Access Plans Consider query “SELECT * FROM R WHERE A=5” Three query plans: –Scan relation R Sequential read of all pages in R Regardless of how many tuples have A=5 –Use clustered index on A Sequential read of relevant pages in R Num. relevant pages = (# of tuples with A=5) / (# of tuples per page) Plus overhead of accessing index pages –Use unclustered index on A Random read of relevant pages in R Number of relevant pages = (# of tuples with A=5) –Less if A is highly correlated with sort order of relation Plus overhead of accessing index pages

Comparing Access Plans Clustered index is always best –Unless all tuples are being returned (then use scan) –But clustered index may not be available Unclustered index beats scan when fraction of tuples returned is small –Depends on these factors: % of tuples being returned Cost ratio of random I/O vs. sequential I/O # of tuples per page –Query returns >10% of rows → scan is almost certainly faster

Covering Indexes Example using index in a book: –“What does this book say about fact tables?” Look up “fact tables” in the index Turn to each page that is listed Read that page and see what it says –“Which of these topics are discussed in this book: fact tables, bridge tables, B-trees?” Look up the three topics in the index See how many of them appear Don’t need to read any of the actual book

Covering Indexes Sometimes an index has all the data you need –Allows index-only query plan –Not necessary to access the actual tuples –Such an index is called a covering index SELECT COUNT(*) FROM R WHERE A=5 –Use index on A –Count number of entries –No need to look up records referenced by RIDs An index is a “thin” copy of a relation –Not all columns from the relation are included –The index is sorted in a particular way

Multi-Column Indexes Multi-column indexes are very useful in data warehousing –We say such an index has a composite key Example: B-Tree index on (A,B) –Search key is (A,B) combination –Index entries sorted by A value –Entries with same A value are sorted by B value –Called a lexicographic sort SELECT SUM(B) FROM R WHERE A=5 –Our (A,B) index covers this query! Coverage vs. size trade-off –More attributes in search key → index covers more queries –More attributes in search key → index takes up more disk space

Fact and Dimension Indexes Dimension table index Narrow version of table with only frequently-queried attributes Always include dimension key! Improve performance on large dimension tables Fact table index Narrow version of fact that omits certain dimensions / measures Useful for queries that exclusively reference indexed dimensions / measures

Order of Composite Key Index on (A,B) ≠ Index on (B,A) –Can efficiently search based on leading terms –No efficient search for trailing terms SELECT SUM(B) FROM R WHERE A=5 –Index on (A,B) is sorted by A Search for records where A=5 Scan only the relevant portion of the index –Index on (B,A) is sorted by B Records with A=5 are scattered throughout index Need to scan the entire index Or else do one search for each distinct value of B –Oracle’s “index skip scans” –Index on (A,B) is better for this query –Either index is much faster than accessing relation!

Index Summary Indexes are useful in two ways: –Indexes allow efficient search on some attributes due to the way they are organized –Index-only plans use small indexes in place of large relations For OLAP queries, the second use is generally more important –Search via non-covering, non-clustered index leads to random I/O –Analysis queries typically aggregate lots of tuples –Doing one random I/O per tuple can be costly

Example Sales(Date, Store, Product, Promotion, TransactionId, Quantity, DollarAmt) –Index on (Date, Store, Quantity, DollarAmt) –Index on (Date, Promotion, Product, Quantity, DollarAmt) –Index on (Product, Date, Store, Quantity, DollarAmt) Store –Index on (Name, District, StoreKey) Product –Index on (Name, Brand, Dept, ProductKey) –Index on (Brand, Dept, ProductKey)

Example Query SELECT Brand, SUM(DollarAmt) FROM Sales, Product, Store WHERE Sales.ProductKey = Product.ProductKey AND Sales.StoreKey = Store.StoreKey AND Store.Name = 'Crystal Springs Safeway‘ GROUP BY Brand Product: Brand Store: Name Sales: DollarAmt

Selecting Indexes Sales(Date, Store, Product, Promotion, TransactionId, Quantity, DollarAmt) –Index on (Date, Store, Quantity, DollarAmt) –Index on (Date, Promotion, Product, Quantity, DollarAmt) –Index on (Product, Date, Store, Quantity, DollarAmt) Store –Index on (Name, District, StoreKey) Product –Index on (Name, Brand, Dept, ProductKey) –Index on (Brand, Dept, ProductKey) Lacks Product Lacks Store Wider Than Needed

Query Plan Search Store(Name, District, StoreKey) index for Name=‘Crystal Springs Safeway’ Nested Loop Join –Outer = Sales(Product,Date,Store,Quantity,DollarAmt) index –Inner = Qualifying Store index entries –Output preserves sort order of Sales index Sort Product(Brand,Dept,ProductKey) index entries by ProductKey Merge Join –Result of Nested Loop Join (already sorted by ProductKey) –Product(Brand,Dept,ProductKey) Hash resulting tuples on Brand (for GROUP BY) –Compute SUM(DollarAmt) for each Brand

Index Intersection Suppose we have table R(A,B,C,D,E) –B-Tree index on A –B-Tree index on B –No multi-column indexes SELECT COUNT(*) FROM R WHERE A=5 AND B < 10 Use an index intersection plan –Search A index for A=5 Index entries have Think of the index as a 2-column table with schema I1(A,RID) –Search B index for B<10 Index entries have Think of the index as a 2-column table with schema I2(B,RID) –Join qualifying index entries on I1.RID = I2.RID

Index Intersection Index intersection works well for conjunction of multiple, moderately selective filters –SELECT SUM(C) FROM R WHERE A=5 AND B<10 –5% of rows have A=5 –5% of rows have B<10 –5% * 5% = 0.25% of rows have A=5 AND B<10 –Retrieving rows matching A index alone, or B index alone, would be slow –Only a few rows match both indexes Intersect indexes and retrieve rows that match both –Overhead of joining indexes often small relative to cost of retrieving matching records from relation

Bitmap Indexes Earlier idea: use RID-lists in place of RIDs –Save space when attribute values repeat Bitmap indexes take this one step further –Use Bitmap in place of RID-list –Each RID in the entire relation is represented by 1 bit 1 = RID is present in RID-list 0 = RID is absent from RID-list –Bitmaps are usually compressed E.g using run-length encoding

Bitmap Index Example Bitmap index looks like this: IDNameSex 1FredM 2JillF 3JoeM 4FranF 5EllenF 6KateF 7MattM 8BobM

Why Bitmap Indexes? Index intersection plans with bitmap indexes are fast –Just perform bitwise AND! –Index intersection with B-Trees requires a join SELECT COUNT(*) FROM R WHERE A=5 AND B < 10 –Bitmap index on A –Bitmap index on B –OR together bitmaps for B values that are < 10 –AND the result with the bitmap for A=5 –Can be computed very quickly Assuming not too many distinct B values that are < 10 Save space for low-cardinality attributes –As compared to a B-Tree or Hash index –Particularly if compression is used Most useful for attributes with low or medium cardinality –Not good for something like LastName

Compressing Bitmaps Consider a bitmap index on an attribute with 20 distinct values Each row has 1 value for that attribute 20 different bitmaps –i th bit is set to 1 in one bitmap –i th is set to 0 in 19 bitmaps Bitmaps consist mostly of zeros (95% of bits are zero) –Good opportunity for compression Compression via run length encoding –Just record number of zeros between adjacent ones – –Store this as “7,4,12,0,5” Compression Pros and Cons –Reduce storage space → reduce number of I/Os required –Need to compress/uncompress → increase CPU work required –Each compression scheme negotiates this trade-off differently –Operate directly on compressed bitmap → improved performance