1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.

Slides:



Advertisements
Similar presentations
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Advertisements

Indexing Large Data COMP # 22
Chapter 4 : File Systems What is a file system?
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Multidimensional Data
©Brooks/Cole, 2003 Chapter 5 Computer Organization.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
MIS 451 Building Business Intelligence Systems Logical Design (5) – Aggregate.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Dimensional Modeling – Part 2
©Brooks/Cole, 2003 Chapter 5 Computer Organization.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Efficient Storage and Retrieval of Data
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Dimensional Modeling I Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) From Information Management to Knowledge Management Olivia R. Liu Sheng, Ph.D.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Dimensional Modeling II Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) The Data Warehouse Lifecycle Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
MIS 451 Building Business Intelligence Systems Logical Design (3) – Design Multiple-fact Dimensional Model.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
Primary Indexes Dense Indexes
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
By N.Gopinath AP/CSE. Two common multi-dimensional schemas are 1. Star schema: Consists of a fact table with a single table for each dimension 2. Snowflake.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business Dimensional.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW Indexing Large Data COMP
1 Lecture 7: Data structures for databases I Jose M. Peña
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
Comp 335 – File Structures Why File Structures?. Goal of the Class To develop an understanding of the file I/O process. Software must be able to interact.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
OSes: 11. FS Impl. 1 Operating Systems v Objectives –discuss file storage and access on secondary storage (a hard disk) Certificate Program in Software.
Indexing.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Memory The term memory is referred to computer’s main memory, or RAM (Random Access Memory). RAM is the location where data and programs are stored (temporarily),
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
Main Memory Main memory – –a collection of storage locations, –each with a unique identifier called the address. Word- –Data are transferred to and from.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 12: File System Implementation.
CPS216: Data-intensive Computing Systems
Oracle SQL*Loader
COMP 430 Intro. to Database Systems
CPSC-310 Database Systems
Chapter 14: File-System Implementation
Lecture 20: Indexes Monday, February 27, 2006.
MIS 451 Building Business Intelligence Systems
Presentation transcript:

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

2 It’s all about trading storage for speed! Fundamentals Aggregates (Ch. 16, pp ) Indexes (Ch. 16, p. 357)

3 Fundamentals: the Storage Hierarchy CPU Cache Memory Disk Storage Capacity Small Large Access Speed Slow Fast second second second MIPS 512 KB 512 MB 512 GB

4 Fundamentals: the Storage Hierarchy CPU Memory Disk Disk Drive (I/O Channel) Cache Bus How long does it take to query sales by city? How large is the Sales Fact table? How long does it take to access the Sales Fact table?

5 Fundamentals How large is the fact table? e.g., 1 million records/day, 0.2KB/record  0.2 GB/day

6 Fundamentals How long does it take to access all the fact records? E.g., the small fact table is 1 Terabyte in size! –0.01s*10 12 =325 years LONG!!!!!!!!!!!!!

7 Fundamentals: the Storage Hierarchy CPU Memory Disk Disk Drive (I/O Channel) Cache Bus The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)

8 Fundamentals How long does it take to access all the fact records? E.g., the small fact table is 1 Terabyte in size! –Number of blocks: 2.5 millions –Access time = 0.01s* = < 7 hrs!!!

9 Aggregate In data warehouse design, we choose the gain of fact table to be the possible lowest level. Grain: orderline

10 Aggregate The reasons to choose the lowest level of fact: –(X) Analysts want to query on single record –(O) Analysts want to flexibly cut and group records.

11 Aggregate However, keeping the most detailed fact records could result in –huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record -> 0.2 GB/day) –slow query

12 Aggregate To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail. To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps. Reduced to a logical design solution

13 Aggregate An aggregate fact table is a fact table that summarizes base-level fact table records along one or several dimensions. An aggregate dimension table is a dimension table that summarizes base-level dimension table records. E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension

14 Aggregate Aggregate fact table Aggregate dimension table

15 Aggregate

16 Indexes How long does it take to find out the total purchase Amt by Tom Jones?

17 Indexes Customer table –1M records, each record Kbytes long –Block is 4K size, block access time is 0.01s –Number of records/block: 4/0.2=20 –Number of blocks: 1M/20=50K Sequential search –Time: 25K*0.01s=250s=4min.

18 Indexes Binary search –Time: log(50K)*0.01s=16*0.01s=0.16s B+ tree index –Create index pn on customer(cname); –If each node (block) in B+ tree has 117 keys, then # of access to indexes: log 117 (1M)=3 (i.e.height of the tree) # of access to Customer Dimension: 1 Total time = 4*0.01 = 0.04s

19... (11 key values, 12 pointers)... B+-trees - P=12 Indexes to customer records ………. Indexes to indexes

20 Indexes How long does it take to find out the total sales of Desktop computers?

21 Performance Improvement Suppose there are only 4 product categories for 1M products Create a B+ tree index??? –Suppose the size of product category and block ID is 10 bytes –Size of index = 1M * 10 = 10 M bytes

22 Performance Improvement A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

23 Bitmaps Product record 1record 2record 3 A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

24 Performance Improvement Bitmap index is suitable for low cardinality attribute. –Cardinality(A) = # of possible values for A/#of records Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes –Storage space saving (1M*4/8=500K bytes) –Efficient for boolean operations CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);