SEWI ZG514 Data Warehousing

Slides:

Advertisements

Similar presentations

February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.

Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?

File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.

Sampling Distributions

Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.

Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.

Distributed Databases Logical next step in geographically dispersed organisations goal is to provide location transparency starting point = a set of decentralised.

Other time considerations Source: Simon Garrett Modifications by Evan Korth.

Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

Computer Organization and Architecture

PARTITIONING “ A de-normalization practice in which relations are split instead of merger ”

©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.

Distributed Databases

The Design and Implementation of a Log-Structured File System Presented by Carl Yao.

PMIT-6102 Advanced Database Systems

CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.

Systems analysis and design, 6th edition Dennis, wixom, and roth

Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.

CSC271 Database Systems Lecture # 30.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CMPE 421 Parallel Computer Architecture

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.

OnLine Analytical Processing (OLAP)

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.

Session-8 Data Management for Decision Support

Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.

1 Data Warehouses BUAD/American University Data Warehouses.

3. Data Warehouse Architecture

Memory Management – Page 1 of 49CSCI 4717 – Computer Architecture Memory Management Uni-program – memory split into two parts –One for Operating System.

Subject: Operating System.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.

What is Sure Stats? Sure Stats is an add-on for SAP that provides Organizations with detailed Statistical Information about how their SAP system is being.

Views Lesson 7.

Advantage of File-oriented system: it provides useful historical information about how data are managed earlier. File-oriented systems create many problems.

Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)

Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.

DATABASE MANAGEMENT SYSTEM ARCHITECTURE

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Operating Systems (CS 340 D) Princess Nora University Faculty of Computer & Information Systems Computer science Department.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.

for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.

The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.

Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

( ) 1 Chapter # 8 How Data is stored DATABASE.

Operating Systems c. define and explain the purpose of scheduling, job queues, priorities and how they are used to manage job throughput; d. explain how.

SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.

Practical Database Design and Tuning

Parallel Databases.

Physical Database Design and Performance

G063 - Distributed Databases

Database System Architectures

Operating Systems: Internals and Design Principles, 6/E

Virtual Memory 1 1.

Presentation transcript:

SEWI ZG514 Data Warehousing Performance Enhancing Techniques: Partitioning Strategy Aggregation Purushotham BV utham74@gmail.com Confidential © 2008 Wipro Ltd

Performance Enhancing Techniques Partitioning Strategy Introduction Horizontal Partitioning Vertical Partitioning Hardware Partitioning Which Key to Partition by? Sizing the Partition Aggregations Why Aggregate? What is an Aggregation? Designing Summary Tables Which Summaries to create? Summary

Partitioning Partitioning is performed for a number of performance related and manageability reasons, and the strategy as a whole must balance all the various requirements. Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing. Three types: Horizontal Partitioning. Vertical partitioning. Hardware Partitioning.

Horizontal Partitioning The table is partitioned after the first few thousand entries, and the next few thousand entries etc. This is because in most cases, not all the information in the fact table needed all the time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries. Horizontal partitioning the fact table was a good way to speed up Queries, by minimizing the set of data to be scanned(without using an index). Partition a fact table into segments.

Horizontal Partitioning (Contd.,) Each segment of different size, because the number of transaction within the business at a given point in the year may not be the same. Example Higher transaction volume at peak periods, such as Christmas etc. If sales fact table is partitioned monthly.

Horizontal Partitioning Various ways in which fact data can be partitioned, before deciding on the optimum solution, we have to consider the requirements for manageability of the data warehouse. Partitioning by Time into Different –sized segments. Partitioning on a Different Dimension. Partitioning by Size of Table. Using Round Robin Partitions

Partitioning by Time into Equal Segments Partition the fact table on a time period basis. Example Partitioning into monthly segments, Number of tables does not exceed in the order of 500. Number of the partitions will store transactions over a busy period in the business, and that the rest may be substantially smaller. This is the most straight forward method of partitioning by months or years etc.

Partitioning by Time into Equal Segments (Contd.,) This will help if the queries often come regarding the fortnightly or monthly performance / sales etc.

Advantages and Disadvantages The advantage is that the slots are reusable. Suppose we are sure that we will no more need the data of 10 years back, then we can simply delete the data of that slot and use it again. Serious draw back in this scheme If the partitions tend to differ too much in size. The number of visitors visiting a hill station, say in summer months, will be much larger than in winter months and hence the size of the segment should be big enough to take case of the summer rush. This, of course, would mean wastage of space during winter month data space. Partitioning tables into same sized segments course, would mean wastage of space during winter

Partitioning by Time into Different –Sized Segments. Three monthly partitions for the last three months (including current month). One quarterly partition for the previous quarter. One half-year partition for the remainder of the year.

Advantages and Disadvantages Detailed information remains available online, without having to restore to using aggregations. Number of physical tables is kept relatively small, reducing operating costs. This technique may be particularly appropriate in environments that require a mix of data dipping recent history. The partitioning profile will change on a regular basis Repartitioning will increase the operational cost of the data warehouse.

Partitioning on a Different Dimension Data collection and storing need not always be partitioned based on time, though it is a very safe and relatively straight forward method. It can be partitioned based on the different regions of operation, different items under consideration or any other such dimension. Most of the queries are likely to be based on the region wise performance, region wise sales etc.

Partitioning on a Different Dimension (Contd.,) If we are worried about the total performance of all regions, total sales of a month or total sales of a product etc, then region wise partitioning could be a disadvantage, since each such queries will have to move across several partitions.

Partitioning by size of table We will not be sure of any dimension on which partitions can be made. Neither the time nor the products or regions etc. We are sure of the type of queries that we are likely to frequently encounter. In such cases, it is ideal to partition by size. Loading the data until a pre-specified memory is consumed, then create a new partition. However, this creates a very complex situation similar to simply dumping the objects in a room. Normally metadata (data about data) may be needed to keep track of the identifications of data stored in each of the partitions.

Using Round Robin Partitions Once the warehouse is holding full amount of data, if a new partition is required, it can be done only by reusing the oldest partition. Then meta data is needed to note the beginning and ending of the historical data. This method, though simple, may land into trouble, if the sizes of the partitions are not same. Special techniques to hold the overflowing data may become necessary.

Vertical Partitioning As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is divided into 2 or more partitions.

Vertical Partitioning (Contd.,) Consider the following table:

Normalization The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anomalies of data

Row Splitting The method involves identifying the not so frequently used fields and putting them into another table. This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.

Hardware Partitioning The data ware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture. The exact details of optimization depends on the hardware platforms. Normally the following guidelines are useful: maximize the processing power availability, maximize disk and I/O operations. reduce bottlenecks at the CPU and I/O throughput.

Maximizing the Processing and Avoiding Bottlenecks One of the ways of ensuring faster processing is to split the data query into several parallel queries, convert them into parallel threads and run them parallelly. This method will work only when there are sufficient number of processors or sufficient processing power to ensure that they can actually run in parallel. Example: To run five threads, it is not always necessary that we should have five processors. But to ensure optimality, even a lesser number of processors should be able to do the job, provided they are able to do it fast enough to avoid bottlenecks at processor. Shared architectures are ideal for such situations, because one can be almost sure that sufficient processing powers are available at most of the times.

Maximizing the Processing and Avoiding Bottlenecks In such a networked environment, where each of the processors is able access data on several active disks, several problems of data contention and data integrity need to be resolved

Stripping Data Across MPP Nodes This mechanism distributes the data by dividing a large table into several smaller units and storing them in each of the disks. There sub tables need not be of equal size, but are so distributed to ensure optimum query performance. The trick is to ensure that the queries are directed to the respective processors, which access the corresponding data disks to service the queries.

Stripping Data Across MPP Nodes (Contd.,) The method is unsuitable for smaller data volumes.

Horizontal Hardware Partitioning This technique spreads the processing load by horizontally partitioning the fact table into smaller segments and physically storing each segment into a different node. When a query needs to access in several partitions, the accessing is done in a way similar to the above methods. If the query is parallelized, then each sub query can run on the other nodes

Horizontal H/w Partitioning (Contd.,) This technique will minimize the traffic on the network.

Why Key to Partition By? It is very crucial If working key is chosen, eventually end up having to totally recognize your fact data

Why Key to Partition By? (Contd.,) Could be chosen to partition on any key, possibly: region transaction_date Suppose the business is organized into 20 geographical regions, each with a varying number of branches of different sizes It leads to 20 regions, which is reasonable Nice partitioning scheme, covers vast majority of queries are restricted to the user’s own business region

Why Key to Partition By? (Contd.,) If partitioned by transaction_date rather than region All the latest transactions from every region will be in one partition This is horrible, because user wanted by region has to look across multiple partitions So partition by region is better.

Sizing the Partition Key decision made on the size of partition used, will affect the consideration The SLA also acts as a limit on the size of any partitioning scheme A partition will most likely become the unit of backup and recovery The availability stipulations in the SLA will act as a limit on the size of a partition The disk setup used will act as a constraint on the number of partitions you can use Query performance is a major consideration

Summary Partitioning Horizontal Partitioning Vertical Partitioning Hardware Partitioning Which Key to Partition by? Sizing the Partition

Thank You