Granularity in the Data Warehouse Chapter 4. Raw Estimates The single most important design issue facing the data warehouse developer is determining the.

Slides:



Advertisements
Similar presentations
Logistics & Channel Management
Advertisements

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Software Design Introduction  The chapter will address the following questions:  How do you factor a program into manageable program modules that can.
Chapter 7 Structuring System Process Requirements
Chapter 1 Section II Fundamentals of Information Systems
What is Software Design?  Introduction  Software design consists of two components, modular design and packaging.  Modular design is the decomposition.
Copyright Irwin/McGraw-Hill Software Design Prepared by Kevin C. Dittman for Systems Analysis & Design Methods 4ed by J. L. Whitten & L. D. Bentley.
Software Design Deriving a solution which satisfies software requirements.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Data Structures Hash Tables
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
CS351 © 2003 Ray S. Babcock Cost Estimation ● I've got Bad News and Bad News!
CS 104 Introduction to Computer Science and Graphics Problems Operating Systems (4) File Management & Input/Out Systems 10/14/2008 Yang Song (Prepared.
Granularity in the Data Warehouse. Raw Estimate  The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect.
Physical Design CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 Physical Design Steps 1. Develop standards 2.
Chapter 11 Operating Systems
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
System Analysis Overview Document functional requirements by creating models Two concepts help identify functional requirements in the traditional approach.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Introduction to Systems Analysis and Design Trisha Cummings.
Chapter 6 The Traditional Approach to Requirements
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
File Organization Techniques
Systems analysis and design, 6th edition Dennis, wixom, and roth
Computers Are Your Future Eleventh Edition Chapter 13: Systems Analysis & Design Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall1.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Multiplexing.
Chapter 6 Control Using Wireless Throttling Valves.
MICROPROCESSOR INPUT/OUTPUT
2Object-Oriented Analysis and Design with the Unified Process The Requirements Discipline in More Detail  Focus shifts from defining to realizing objectives.
Chapter 4 Storage Management (Memory Management).
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
1 CS 350 Data Structures Chaminade University of Honolulu.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Chapter 6. Control Charts for Variables. Subgroup Data with Unknown  and 
Understanding Allocations Brian Chizever Cognos Corporation.
Chapter 13 Logistics and Channel Management. Logistics13 Objective 1: L ogistics Planning, implementing, and controlling the physical flows of materials.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Management Information Systems, 4 th Edition 1 Three decision-making phases(p.481) –Intelligence –Design –Choice 1. The Decision-Making Process Chapter.
Project Management. Introduction  Project management process goes alongside the system development process Process management process made up of three.
Chapter 13 Estimates for Remodeling Work
CS 147 Virtual Memory Prof. Sin Min Lee Anthony Palladino.
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
Granularity in the Data Warehouse Chapter 4. Raw Estimates The single most important design issue facing the data warehouse developer is determining the.
6 Systems Analysis and Design in a Changing World, Fourth Edition.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Chapter 13 Logistics and Channel Management.
Building a Data Warehouse
Memory Management.
Chapter 6 The Traditional Approach to Requirements.
Memory COMPUTER ARCHITECTURE
Software Testing.
Indexing Structures for Files and Physical Database Design
Physical Database Design and Performance
Data collection methodology and NM paradigms
Oracle Solaris Zones Study Purpose Only
Main Memory Management
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
MANAGING DATA RESOURCES
Objective of This Course
Main Memory Background Swapping Contiguous Allocation Paging
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Cost Estimation I've got Bad News and Bad News!.
Chapter 13 Logistics and Channel Management.
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
Presentation transcript:

Granularity in the Data Warehouse Chapter 4

Raw Estimates The single most important design issue facing the data warehouse developer is determining the proper level of granularity of the data that will reside in the data warehouse. Granularity is also important to the warehouse architect because it affects all the environments that depend on the warehouse for data. Granularity affects how efficiently data can be shipped to the different environments and determines the types of analysis that can be done. The primary issue of granularity is that of getting it at the right level. The level of granularity needs to be neither too high nor too low.

Raw Estimates The starting point for determining the appropriate level of granularity is to do a raw estimate of the number of rows of data and the DASD (direct access storage device) that will be in the data warehouse. The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect a great deal. –If there are only 10,000 rows, almost any level of granularity will do. –If there are 10 million rows, a low level of granularity is possible. –If there are 10 billion rows, not only is a higher level of granularity needed, but a major portion of the data will probably go into overflow storage. Figure shows an algorithmic path to calculate the space occupied by a data warehouse.

Input to the Planning Process The estimate of rows and DASD then serves as input to the planning process, as shown in Figure

Data in Overflow Once the raw estimate as to the size of the data warehouse is made, the next step is to compare the total number of rows in the warehouse environment to the charts shown in Figure. Depending on how many total rows will be in the warehouse environment, different approaches to design, development, and storage are necessary.

Data in Overflow

On the five-year horizon, the totals shift by about an order of magnitude or perhaps even more. The theory is that after five years, these factors will be in place: ■■ There will be more expertise available in managing the data warehouse volumes of data. ■■ Hardware costs will have dropped to some extent. ■■ More powerful software tools will be available. ■■ The end user will be more sophisticated.

Overflow Storage Data in the data warehouse environment grows at a rate never before seen by IT professionals. The combination of historical data and detailed data produces a growth rate that is phenomenal. As data grows large, a natural subdivision of data occurs between actively used data and inactively used data. Inactive data is sometimes called dormant data or infrequently used data. At some point in the life of the data warehouse, the vast majority of the data in the warehouse becomes stale and unused. At this point, it makes sense to start separating the data onto different storage media.

Overflow Storage Figure shows that a data monitor is needed to determine the usage of data. The data monitor tells where to place data by determining what data is and is not being used in the data warehouse. The movement between disk storage and near-line storage is controlled by means of software called a cross-media storage manager (CMSM). The data in alternate storage or near-line storage can be accessed directly by means of software that has the intelligence to know where data is located in near-line storage. These three software components are the minimum required for alternate storage or near-line storage to be used effectively.

What the Levels of Granularity Will Be

Some Feedback Loop Techniques Build the first parts of the data warehouse in very small, very fast steps, and carefully listen to the end users’ comments at the end of each step of development. Be prepared to make adjustments quickly. If available, use prototyping and allow the feedback loop to function using observations gleaned from the prototype. Look at how other people have built their levels of granularity and learn from their experience. Go through the feedback process with an experienced user who is aware of the process occurring. Under no circumstances should you keep your users in the dark as to the dynamics of the feedback loop. Look at whatever the organization has now that appears to be working, and use those functional requirements as a guideline. Execute joint application design (JAD) sessions and simulate the output to achieve the desired feedback.

Some Feedback Loop Techniques Granularity of data can be raised in many ways, such as the following: Summarize data from the source as it goes into the target. Average or otherwise calculate data as it goes into the target. Push highest and/or lowest set values into the target. Push only data that is obviously needed into the target. Use conditional logic to select only a subset of records to go into the target. The ways that data may be summarized or aggregated are limitless. When building a data warehouse, keep one important point in mind: In classical requirements systems development, it is unwise to proceed until the vast majority of the requirements are identified. But in building the data warehouse, it is unwise not to proceed if at least half of the requirements for the data warehouse are identified.

Levels of Granularity — Banking Environment

The Data Warehouse and Technology Chapter 5

Some basic requirement of technology supporting Data warehousing

Managing Large Amounts of Data In the ideal case, the data warehouse developer builds a data warehouse under the assumption that the technology that houses the data warehouse can handle the volumes required. When the designer has to go to extraordinary lengths in design and implementation to map the technology to the data warehouse, then there is a problem with the underlying technology. When technology is an issue, it is normal to engage more than one technology. The ability to participate in moving dormant data to overflow storage is perhaps the most strategic capability that a technology can have.

Managing Multiple Media In conjunction with managing large amounts of data efficiently and cost effectively, the technology underlying the data warehouse must handle multiple storage media. It is insufficient to manage a mature data warehouse on DASD alone. Following is a hierarchy of storage of data in terms of speed of access and cost of storage:

Indexing and Monitoring Data Of course, the designer uses many practices to make data as flexible as possible, such as spreading data across different storage media and partitioning data. But the technology that houses the data must be able to support easy indexing as well. Unlike the monitoring of transaction processing, where the transactions themselves are monitored, data warehouse activity monitoring determines what data has and has not been used. Monitoring data warehouse data determines such factors as the following: –If a reorganization needs to be done –If an index is poorly structured –If too much or not enough data is in overflow –The statistical composition of the access of the data –Available remaining space

Interfaces to Many Technologies