Download presentation
1
Data Warehousing Dale-Marie Wilson, Ph.D.
2
Evolution of Data Warehousing
Since 1970s, organizations gained competitive advantage Automated business processes More efficient and cost-effective services to customer Resulted in accumulation of growing amounts of data in operational databases
3
Evolution of Data Warehousing
Increased focus on ways to use operational data to support decision-making Means of gaining competitive advantage Operational systems not designed to support such business activities Typically numerous operational systems with overlapping and contradictory definitions Organizations need to turn archives of data into source of knowledge Goal: single integrated / consolidated view of organization’s data presented to user Solution: Data Warehouse Provides system capable of supporting decision-making, receiving data from multiple operational data sources
4
Data Warehousing Concepts
A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993)
5
Subject-oriented Data
Warehouse organized around major subjects of the enterprise e.g. customers, products, and sales Not major application areas (e.g. customer invoicing, stock control, and product sales) Stores decision-support data not application-oriented data
6
Integrated Data Integrates corporate application-oriented data from different source systems Includes inconsistent data Integrated data source made consistent Presents unified view of data to users
7
Time-variant Data Data accurate and valid at instance in time or over time interval Time-variance shown in: Extended time data held Implicit/explicit association of time with data Data represents series of snapshots
8
Non-volatile Data Data not updated real-time
Refreshed from operational systems on regular basis New data added as supplement not replacement
9
Data Webhouse Web is source of behavioral data
Clickstream – user’s path thru Website and Web history Data webhouse is a distributed data warehouse with no central data repository that is implemented over the Web to harness clickstream data
10
Benefits of Data Warehouse
Potential high returns on investment Competitive advantage Increased productivity of corporate decision-makers
11
Comparison of OLTP Systems and Data Warehousing
12
Data Warehouse Queries
Range from relatively simple to highly complex Dependent on end-user access tools used End-user access tools: Reporting, query, and application development tools Executive information systems (EIS) OLAP tools Data mining tools
13
Examples of Typical Data Warehouse Queries
What was the total revenue for Scotland in the third quarter of 2004? What was the total revenue for property sales for each type of property in Great Britain in 2003? What are the three most popular areas in each city for the renting of property in 2004 and how does this compare with the figures for the previous two years? What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures? What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000? Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data? What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office?
14
Problems of Data Warehousing
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration
15
Typical Architecture of Data Warehouse
16
Operational Data Resources
Mainframe first generation hierarchical and network databases Departmental propriety file systems (e.g. VSAM, RMS) Relational DBMSs (e.g. Informix, Oracle) Private workstations and servers External systems Internet Commercially available databases Databases associated with organization’s suppliers or customers
17
Operational Data Store (ODS)
Repository of current and integrated operational data used for analysis Structured and supplied with data like data warehouse May act as staging area for data to be moved into warehouse Created when legacy operational systems incapable of achieving reporting requirements Benefits: Provides users with ease-of-use of relational database Distant from decision support functions of data warehouse
18
Load Manager Performs operations associated with extraction and loading of data Size and complexity varies between data warehouses Constructed using combination of vendor data loading tools and custom-built programs
19
Warehouse Manager Performs operations associated with management of data Constructed using vendor data management tools and custom-built programs
20
Warehouse Manager Performs operations associated with management of data Constructed using vendor data management tools and custom-built programs Operations: Data analysis to ensure consistency Transformation and merging of source data from temporary storage Creation of indexes and views on base tables Generation of denormalizations, (if necessary) Generation of aggregations, (if necessary) Backing-up and archiving data
21
Warehouse Manager Generates query profiles to determine which indexes and aggregations are appropriate Query profile Can be generated for each user, group of users, or the data warehouse Describes characteristics of queries Frequency Target table(s) Size of results set
22
Query Manager Performs operations associated with management of user queries Constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs Complexity determined by facilities provided by end-user access tools and database Operations: Directing queries to appropriate tables Scheduling execution of queries Can generate query profiles Allows warehouse manager to determine appropriate indexes and aggregations
23
Detailed Data Detailed data stored in database schema
Not stored online Aggregated to next level of detail Regularly added to warehouse to supplement aggregated data
24
Lightly and Highly Summarized Data
Stores pre-defined lightly and highly aggregated data generated by warehouse manager Transient - changes to respond to changing query profiles Purpose of summary information Improve query performance Removes requirement to continually perform summary operations in answering user queries Summary data updated continuously as new data loaded into warehouse
25
Archive/Backup Data Stores detailed and summarized data for archiving and backup Data transferred to storage archives - magnetic tape or optical disk
26
Metadata Stores metadata (data about data) definitions used by all processes in warehouse Used for: Extraction and loading processes Used to map data sources to common view of information within warehouse Warehouse management process Used to automate production of summary tables Query management process Used to direct query to most appropriate data source
27
Metadata Metadata structure differs between processes Issues:
Different purposes Issues: Multiple copies of metadata describe same data item Vendor tools and end-user data access use own versions of metadata Copy management tools use metadata to understand mapping rules that are applied to convert source data into common form End-user access tools use metadata to understand how to build a query The management of metadata within data warehouse is very complex task that should not be underestimated
28
End-User Access Tools Principal purpose of data warehousing:
To provide information to business users for strategic decision-making Users interact with warehouse using end-user access tools Data warehouse must efficiently support ad hoc and routine analysis High performance achieved by: Pre-planning requirements for joins Summations Periodic reports by end-users (where possible) Main groups of access tools Data reporting and query tools Application development tools Executive information system (EIS) tools Online analytical processing (OLAP) tools Data mining tools
29
Data Warehouse Information Flows
30
Data Warehouse Information Flows
Inflow - Processes associated with extraction, cleansing, and loading data from source systems Upflow - Processes associated with adding value to data in warehouse through summarizing, packaging, and distribution Downflow - Processes associated with archiving and backing-up/recovery of data Outflow - Processes associated with making data available to end-users Metaflow - Processes associated with management of metadata
31
Data Warehousing Tools and Technologies
Building data warehouse is complex task No vendor that provides an ‘end-to-end’ set of tools Necessitates data warehouse built using multiple products from different vendors Major challenge: Ensuring products work well together and are fully integrated
32
Data Warehousing Tools and Technologies
Tasks of capturing data from source systems, cleansing and transforming it, and loading results into target system can be carried out either by separate products, or by a single integrated solution Integrated solutions include Code Generators Database Data Replication Tools Dynamic Transformation Engines
33
Data Warehouse DBMS Requirements
Load performance Load processing Data quality management Query performance Terabyte scalability Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality
34
Administration and Management Tools
Monitoring data loading from multiple sources Data quality and integrity checks Managing and updating metadata Monitoring database performance to ensure efficient query response times and resource utilization Auditing data warehouse usage to provide user chargeback information
35
Administration and Management Tools
Replicating, subsetting, and distributing data Maintaining efficient data storage management Purging data Archiving and backing-up data Implementing recovery following failure Security management
36
Typical Data Warehouse and Data Mart Architecture
37
Data Mart A subset of a data warehouse that supports the requirements of a particular department or business function Characteristics: Focuses on requirements of one department or business function Does not normally contain detailed operational data unlike data warehouses More easily understood and navigated
38
Reasons for Creating a Data Mart
Give users access to data they need to analyze most often Provide data in form that matches collective view of data by group of users in a department or business function area Improve end-user response time Reduction in volume of data to be accessed Provide appropriately structured data as dictated by requirements of end-user access tools Building data mart is simpler compared with establishing corporate data warehouse Cost of implementing data marts less than that required to establish data warehouse Potential users of data mart more clearly defined More easily targeted to obtain support for data mart project
39
Designing Data Warehouses
Initially, need answers for questions such as: Which user requirements are most important and which data should be considered first? Which data should be considered first? Should the project be scaled down into something more manageable? Should the infrastructure for a scaled down project be capable of ultimately delivering a full-scale enterprise-wide data warehouse?
40
Designing Data Warehouses
Use of data marts avoids complexities associated with designing data Difficult to commit to enterprise-wide design that must meet all user requirements Interim solution => build data marts Goal: creation of data warehouse that supports requirements of enterprise
41
Designing Data Warehouses
Requirements collection and analysis stage: Involves interviewing appropriate members of staff (such as marketing users, finance users, and sales users) Identify prioritized set of requirements data warehouse must meet Interviews conducted with members of staff responsible for operational systems Identify, which data sources can provide clean, valid, and consistent data that will remain supported over next few years Interviews provide necessary information for top-down view (user requirements) and bottom-up view (available data sources) Database component of data warehouse described using technique called dimensionality modeling
42
Dimensionality Modelling
Logical design technique that aims to present data in standard, intuitive form that allows for high-performance access Uses Entity-Relationship modeling concepts with important restrictions: Every dimensional model (DM) composed of one table with a composite primary key, called fact table, and set of smaller tables called dimension tables Each dimension table has simple (non-composite) primary key that corresponds exactly to one component of composite key in fact table Forms ‘star-like’ structure called star schema or star join
43
Dimensionality Modelling
Natural keys replaced with surrogate keys Every join between fact and dimension tables based on surrogate keys, not natural keys Surrogate key – generalized structure based on integers Allows data in warehouse independence from data used and produced by OLTP systems
44
Star schema for property sales of DreamHome
45
Dimensionality Modelling
Star schema - logical structure Has fact table containing factual data in center Surrounded by dimension tables containing reference data, which can be denormalized Facts generated by events that occurred in the past, Unlikely to change, regardless of how analyzed
46
Dimensionality Modelling
Fact tables: Where bulk of data in data warehouse Can be extremely large Important to treat fact data as read-only reference data that will not change over time Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record and are numeric and additive
47
Dimensionality Modelling
Dimension tables: Usually contain descriptive textual information Dimension attributes used as constraints in data warehouse queries Star schemas speeds up query performance by denormalizing reference information into single dimension table
48
Dimensionality Modelling
Snowflake schema Variant of the star schema where dimension tables do not contain denormalized data Starflake schema Hybrid structure that contains mixture of star (denormalized) and snowflake (normalized) schemas Allows dimensions to be present in both forms to cater for different query requirements
49
Property sales with normalized version of Branch dimension table
50
Dimensionality Modelling
Advantages of predictable, standard form of underlying dimensional model: Efficiency Ability to handle changing requirements Star schema handles ad hoc user queries well Extensibility Supports changes e.g. adding new dimension, facts Ability to model common business situations Predictable query processing
51
Comparison of DM and ER models
Reduces data redundancy Beneficial to transaction processing Single ER model normally decomposes into multiple DMs Multiple DMs are associated through ‘shared’ dimension tables
52
Database Design Methodology for Data Warehouses
‘Nine-Step Methodology’: Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query modes
53
Step 1: Choosing the process
The process (function) refers to subject matter of particular data mart First data mart built should be: Most likely to be delivered on time Within budget Answers the most commercially important business questions
54
Business process of DreamHome case study
55
Example – Chosen Data Mart
56
Step 2: Choosing the grain
Decide what a record of fact table represents Identify dimensions of fact table Grain decision for fact table also determines grain of each dimension table Include time as core dimension Always present in star schemas
57
Step 3: Identifying and Conforming dimensions
Dimensions set context for asking questions about the facts in fact table If any dimension occurs in two data marts: Must be exactly same dimension Or one must be mathematical subset of other Dimension used in more than one data mart referred to as being conformed
58
Star schemas for property sales and property advertising
59
Step 4: Choosing the facts
Grain of fact table determines which facts can be used in data mart Facts should be numeric and additive Unusable facts include: non-numeric facts non-additive facts fact at different granularity from other facts in table
60
Property rentals with a badly structured fact table
61
Property rentals with fact table corrected
62
Step 5: Storing pre-calculations in the fact table
Once facts selected Re-examine to determine whether there are opportunities to use pre-calculations
63
Step 6: Rounding out the dimension tables
Text descriptions are added to dimension tables Text descriptions should be intuitive and understandable to users Usefulness of data mart determined by scope and nature of attributes of dimension tables
64
Step 7: Choosing the duration of the database
Duration measures how far back in time fact table goes Very large fact tables raises two very significant data warehouse design issues: Often difficult to source increasing old data Mandatory that old versions of important dimensions be used, not the most current versions - aka ‘Slowly Changing Dimension’ problem
65
Step 8: Tracking slowly changing dimensions
Slowly changing dimension problem Proper description of old dimension data must be used with old fact data Generalized key assigned to important dimensions Allows distinction multiple snapshots of dimensions over period of time Three basic types of slowly changing dimensions: Type 1 - where changed dimension attribute overwritten Type 2 - where changed dimension attribute causes new dimension record to be created Type 3 - where a changed dimension attribute causes alternate attribute to be created Both the old and new values of attribute simultaneously accessible in the same dimension record
66
Step 9: Deciding the query priorities and the query modes
Most critical physical design issues affecting end-user’s perception includes: Physical sort order of fact table on disk Presence of pre-stored summaries or aggregations Additional physical design issues: Administration Backup Indexing performance Security
67
Database Design Methodology for Data Warehouses
Methodology designs data mart: Supports requirements of particular business process Allows easy integration with other related data marts to form enterprise-wide data warehouse A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, Referred to as fact constellation
68
Fact and dimension tables for each business process of DreamHome
69
Dimensional model (fact constellation) for the DreamHome data warehouse
70
Chapters 31 & 32 Omit material specific to oracle
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.