Prof. Bayer, DWH, Ch.5, SS 20021 Chapter 5. Indexing for DWH D1Facts D2.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Query Processing and Optimizing on SSDs Flash Group Qingling Cao
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
Hashing and Indexing John Ortiz.
1 Lecture 8: Data structures for databases II Jose M. Peña
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
COMP 451/651 B-Trees Size and Lookup Chapter 1.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Physical Database Monitoring and Tuning the Operational System.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
CS 345: Topics in Data Warehousing Thursday, October 21, 2004.
1 Physical Data Organization and Indexing Lecture 14.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
© 1999 FORWISS FORWISS MISTRAL und DWH 6-2 Processing Relational Queries Using the Multidimensional Access Method UB-Tree Prof. R. Bayer, Ph.D. Dr. Volker.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
© Pearson Education Limited, Chapter 13 Physical Database Design – Step 4 (Choose File Organizations and Indexes) Transparencies.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Storage and Indexing1 Overview of Storage and Indexing.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
© 1999 FORWISS FORWISS MISTRAL Performance of TPC-D Benchmark and Datawarehouses Prof. R. Bayer, Ph.D. Dr. Volker Markl Dept. of Computer Science, Technical.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Methodology – Physical Database Design for Relational Databases.
Indexes and Views Unit 7.
University of Sunderland COM 220 Lecture Ten Slide 1 Database Performance.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 5 Index and Clustering
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Index in Database Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Prof. Bayer, DWH, Ch.7, SS20021 Chapt. 7 Multidimensional Hierarchical Clustering Fig. 3.1 Hierarchies in the `Juice and More´ schema Year (3) Month (12)
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
Prof. Bayer, DWH, Ch.6, SS Chapter 6: UB-tree for Multidimensional Indexing Note: all relational databases are multidimensional: a tuple in a relation.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Indexes 22 Index Table Key Row pointer … WHERE key = 22.
Select Operation Strategies And Indexing (Chapter 8)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
How is data stored? ● Table and index Data are stored in blocks(aka Page). ● All IO is done at least one block at a time. ● Typical block size is 8Kb.
Indexes By Adrienne Watt.
Indexing Structures for Files and Physical Database Design
COMP 430 Intro. to Database Systems
Selected Topics: External Sorting, Join Algorithms, …
Lecture 20: Indexes Monday, February 27, 2006.
Prof. R. Bayer, Ph.D. Dr. Volker Markl
Chapt. 7 Multidimensional Hierarchical Clustering
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #08 Comparisons of Indexes and Indexing Performance Instructor: Chen Li.
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Presentation transcript:

Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2

Prof. Bayer, DWH, Ch.5, SS dimension Time with composite key K1 according to hierarchy key K1 = (year int, month int, day int) dimension Region with composite key K2 according to hierarchy key K2 = (region string, nation string, state string, city string) Facts.key = (K1; K2) = KF create table Facts ( measure real,...) key is (K1, K2)

Prof. Bayer, DWH, Ch.5, SS Variant 1: Facts organized as compound B-tree, e.g. in TransBase (standard) or in Oracle as IOT: Index Organized Table i.e. data, measures are stored on leafs of tree and sorted according to lexicographic order of KF ==> Interval queries for K1 on D1 and on Facts ==> sorted reading according to lexicographic order of KF possible on Facts: tuple clustering!! ==> restrictions on K2 can be used on D2, but not on Facts

Prof. Bayer, DWH, Ch.5, SS Variant 2: Full Table Scan (FTS) page clustering!! without any index support, works well, as soon as >10% of the data must be checked (retrieved from disk) to compute the answer this is an empirical observation with Oracle (similar in other rel DBMS) made with very large DBs ( > 1 GB) in the MISTRAL project Reason: random access 9 ms + page transfer 1 ms = 10 ms time (20 pages sequential) = 29 ms time (20 pages random ) = 200 ms factor 7

Prof. Bayer, DWH, Ch.5, SS Variant 3: Secondary indexes on Facts Problem: no tuple clustering and no page clustering!! create index SI (Facts, K1) create index SI (Facts, K2) select SI (Facts, c1)= list of ROWIDs select SI (Facts, c2)= list of ROWIDs, intersect select SI (Facts, i1)= list of list of ROWIDs for interval i1 = Set1 of ROWIDs select SI (Facts, i2)= list of list of ROWIDs for interval i2 = Set2 of ROWIDs

Prof. Bayer, DWH, Ch.5, SS QueryBox ~ set of tuples with ROWID  of Set1  Set2 This requires the following steps: 1. Sort Set1 2. Sort Set2 3. Compute intersection 4. For all ROWIDs r in intersection : fetch (Facts.r) ==> random access to disk for every tuple in answer

Prof. Bayer, DWH, Ch.5, SS Speed Comparison: assumptions: 8 KB pages 50 tuples per page ~ 160 B/tuple disk parameters as before Variant 1: compound B-tree, tuple clustering: (10 ms/page)/(50 tuples/page) = 200  s/tuple Variant 2: FTS, tuple clustering and page clustering: (29 ms/20 pages)/(50 tuples/page) = 29  s/tuple Variant 3: secondary indexes, no clustering: (10 ms/page)/(1 tuple/page) = 10,000  s/tuple

Prof. Bayer, DWH, Ch.5, SS Conclusions Tuple clustering gains factor 50 (depending on page and tuple size) over no clustering Page and tuple clustering gains factor 345 over no clustering Secondary indexes are a bad idea, except for point queries resulting in a single tuple !!!

Prof. Bayer, DWH, Ch.5, SS Variant 4: Bit-Map indexes Facts with ROWIDs12...k assume that attribute A has potential values a 1, a 2, a 3,..., a lA BMI(A) is a set of Boolean vectors, one for each of a 1, a 2, a 3,..., a lA BMI(A)[a i ] = Boolean array BMI(A)[a i ][1:k] BMI(A)[a i ][j] = true iff Facts.j.A = a j false otherwise for ROWID j

Prof. Bayer, DWH, Ch.5, SS  BMI(A) = 1 2 … k a 1 a 2 : a WA … … 0

Prof. Bayer, DWH, Ch.5, SS Note: in every column of BMI there is exactly one entry with value true ==> extremely sparse matrix, compression? Bitmaps: store rows of BMI in compressed form! Secondary indexes: entry for a i is the list of ROWIDs, which have true in row a i, usually sorted by ROWID, makes intersection more efficient, avoids additional sorting.

Prof. Bayer, DWH, Ch.5, SS Queries: A = V A and B = V B ==>BMI(A)[V A ] and BMI(B)[V B ] yields set of ROWIDs r with Facts.r.A = V A and Facts.r.B = V B ==> these ROWIDs are already sorted and the tuples may be read pseudosequentially from the disk ==> for small result sets this requires 1 page access per result tuple, very slow, factor 50 slower compared to tuple clustering, see later performance results in chapter 6, 7, 8.

Prof. Bayer, DWH, Ch.5, SS Note: bit map indexes and secondary indexes are very similar: Bit map: representation of BMI as Boolean vector Secondary index: representation of BMI as list of those ROWIDs with entry true column representation ~ enumeration type

Prof. Bayer, DWH, Ch.5, SS Variant 5: multidimensional index on the Facts table Grid-file R-tree R*-tree UB-tree Decisive aspects: see chapter on UB-trees tuple clustering page clustering sorted reading and writing utilizing all restrictions of the query box

Prof. Bayer, DWH, Ch.5, SS Variant 6: Hash indexes no tuple clustering no page clustering no sorted reading and writing depends very much on quality of hash functions utilizing all restrictions of the query box only with multiple hash indexes

Prof. Bayer, DWH, Ch.5, SS Variant 7: Join-Indexes Idea: partial materialization of a view for a join R join A S starting point are SI(R,A) and SI(S, A) SI(R, a) = set of ROWIDs of relation R SI(S, a) = set of ROWIDs of relation S Join-Index JI (R, S, A): JI(R,S,a) = set of ROWID-pairs, whose tuples are join- partners.

Prof. Bayer, DWH, Ch.5, SS Note: Result presentation with join-indexes requires 2 random accesses to R and S to produce 1 result tuple, very fast to produce the first result, additional results at about 50 tuples per second, faster than a person can read on the screen Note: in a join (R join A S) the attribute A is usually a primary key of one involved relation (causing tuple clustering) and a secondary key in the other. Then sequential access with tuple clustering on one relation can be exploited, roughly doubles the performance. Note: In DWH applications the relation with the primary key is the dimension table and the relation with the foreign key is the fact table, therefore a slow solution.

Prof. Bayer, DWH, Ch.5, SS Note: JI(R,S,A) „belongs“ to 2 relations, this causes a novel Index-Update-Problem, everytime either R or S are updated Question: Simulation of JI(R,S,A) by SI(R,A) and SI (S,A) and query-rewriting, i.e. optimization??