Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

DENORMALIZATION CSCI 6442 © Copyright 2015, David C. Roberts, all rights reserved.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Lecture-19 ETL Detail: Data Cleansing
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-5 Types & Typical Applications of DWH Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Relational Modeling: Normalization and Denormalization CS 543 – Data Warehousing.
PARTITIONING “ A de-normalization practice in which relations are split instead of merger ”
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Lecture-33 DWH Implementation: Goal Driven Approach (1)
Lecture-1 Introduction and Background
DWH-Ahsan Abdullah 1 Data Warehousing Lab Lect-2 Lab Data Set Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Systems analysis and design, 6th edition Dennis, wixom, and roth
CSC271 Database Systems Lecture # 30.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Database Tuning Prerequisite Cluster Index B+Tree Indexing Hash Indexing ISAM (indexed Sequential access)
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
Ahsan Abdullah 1 Data Warehousing Lecture-7De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 8 Physical Database Design
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-4 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. &
© Pearson Education Limited, Chapter 15 Physical Database Design – Step 7 (Consider Introduction of Controlled Redundancy) Transparencies.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Data Warehousing 1 Lecture-28 Need for Speed: Join Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-14 Process of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-2 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Ahsan Abdullah 1 Data Warehousing Lecture-10 Online Analytical Processing (OLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Dr. Abdul Basit Siddiqui Assistant Professor FUIEMS.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing Lecture-30 What can Data Mining do? Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-29 Brief Intro. to Data Mining Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Chapter 4 Logical & Physical Database Design
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Chapter 5 Index and Clustering
Ahsan Abdullah 1 Data Warehousing Lecture-6Normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Physical Database Design DeSiaMorePowered by DeSiaMore 1.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
1 Distributed Databases architecture, fragmentation, allocation Lecture 1.
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-21 Introduction to Data Quality Management (DQM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Lecture-3 Introduction and Background
Parallel Databases.
Database Management System
Lecture-32 DWH Lifecycle: Methodologies
Physical Database Design and Performance
ITD1312 Database Principles Chapter 5: Physical Database Design
Physical Database Design for Relational Databases Step 3 – Step 8
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Lecture 17: Distributed Transactions
Physical Database Design
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Selected Topics: External Sorting, Join Algorithms, …
Lecture-38 Case Study: Agri-Data Warehouse
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Lecture-35 DWH Implementation: Pitfalls, Mistakes, Keys
Presentation transcript:

Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research National University of Computers & Emerging Sciences, Islamabad

Ahsan Abdullah 2 Issues of De-normalization

Ahsan Abdullah 3 Why Issues?

Ahsan Abdullah 4 Issues of Denormalization  Storage  Performance  Ease-of-use  Maintenance

Ahsan Abdullah 5 Industry Characteristics Master:Detail Ratios  Health care 1:2 ratio  Video Rental 1:3 ratio  Retail 1:30 ratio

Ahsan Abdullah 6 Storage Issues: Pre-joining Facts  Assume 1:2 record count ratio between claim master and detail for health-care application.  Assume 10 million members (20 million records in claim detail).  Assume 10 byte member_ID.  Assume 40 byte header for master and 60 byte header for detail tables.

Ahsan Abdullah 7 With normalization: Total space used = 10 x x 60 = 1.6 GB After denormalization: Total space used = ( – 10) x 20 = 1.8 GB Net result is 12.5% additional space required in raw data table size for the database. Storage Issues: Pre-joining (Calculations)

Ahsan Abdullah 8 Consider the query “How many members were paid claims during last year?” With normalization: Simply count the number of records in the master table. After denormalization: The member_ID would be repeated, hence need a count distinct. This will cause sorting on a larger table and degraded performance. Performance Issues: Pre-joining

Ahsan Abdullah 9 Depending on the query, the performance actually deteriorates with denormalization! This is due to the following three reasons:  Forcing a sort due to count distinct.  Using a table with 1.5 times header size.  Using a table which is 2 times larger.  Resulting in 3 times degradation in performance. Bottom Line: Other than 0.2 GB additional space, also keep the 0.4 GB master table. Why Performance Issues: Pre-joining

Ahsan Abdullah 10 Continuing with the previous Health-Care example, assuming a 60 byte detail table and 10 byte Sale_Person.  Copying the Sale_Person to the detail table results in all scans taking 16% longer than previously.  Justifiable only if significant portion of queries get benefit by accessing the denormalized detail table.  Need to look at the cost-benefit trade-off for each denormalization decision. Performance Issues: Adding redundant columns

Ahsan Abdullah 11 Other issues include, increase in table size, maintenance and loss of information:  The size of the (largest table i.e.) transaction table increases by the size of the Sale_Person key.  For the example being considered, the detail table size increases from 1.2 GB to 1.32 GB.  If the Sale_Person key changes (e.g. new 12 digit NID), then updates to be reflected all the way to transaction table.  In the absence of 1:M relationship, column movement will actually result in loss of data. Other Issues: Adding redundant columns

Ahsan Abdullah 12 Horizontal splitting is a Divide&Conquer technique that exploits parallelism. The conquer part of the technique is about combining the results. Lets see how it works for hash based splitting/partitioning.  Assuming uniform hashing, hash splitting supports even data distribution across all partitions in a pre- defined manner.  However, hash based splitting is not easily reversible to eliminate the split. Ease of use Issues: Horizontal Splitting

Ahsan Abdullah 13 Ease of use Issues: Horizontal Splitting ?

Ahsan Abdullah 14  Round robin and random splitting:  Guarantee good data distribution.  Almost impossible to reverse (or undo).  Not pre-defined. Ease of use Issues: Horizontal Splitting

Ahsan Abdullah 15  Range and expression splitting:  Can facilitate partition elimination with a smart optimizer.  Generally lead to "hot spots” (uneven distribution of data). Ease of use Issues: Horizontal Splitting

Ahsan Abdullah 16 Performance Issues: Horizontal Splitting P1P2P3P Dramatic cancellation of airline reservations after 9/11, resulting in “hot spot” Splitting based on year Processors

Ahsan Abdullah 17 Performance issues: Vertical Splitting Facts Example: Consider a 100 byte header for the member table such that 20 bytes provide complete coverage for 90% of the queries. Split the member table into two parts as follows: 1.Frequently accessed portion of table (20 bytes), and 2. Infrequently accessed portion of table (80+ bytes). Why 80+? Note that primary key (member_id) must be present in both tables for eliminating the split.

Ahsan Abdullah 18 Scanning the claim table for most frequently used queries will be 500% faster with vertical splitting Ironically, for the “infrequently” accessed queries the performance will be inferior as compared to the un-split table because of the join overhead. Performance issues: Vertical Splitting Good vs. Bad