Download presentation
1
Principles of Data Warehousing
Lecturer: Dr. Bo Yuan
2
Multidimensional Data
Outline OLAP Metadata Data Warehouse Data Marts ETL Multidimensional Data
3
A Manager’s Questions …
Who are our lowest or highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What promotions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins?
4
Tourists, Farmers and Explorers
Tourists: Browse information harvested by farmers. Farmers: Harvest information from known access paths. Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data.
5
History & Evolution 60’s: Batch Reports
Hard to find and analyze information Inflexible and expensive, reprogram every new request 70’s: Terminal-Based DSS and EIS Still inflexible, not integrated with desktop tools 80’s: Desktop Data Access and Analysis Tools Query tools, Spreadsheets, GUIs Easier to use, but only access operational databases 90’s: Data Warehousing OLAP Engines and Tools
6
Data Everywhere I cannot find the data I need.
Data are scattered over the network. Many versions I cannot get the data I need. May need experts to get the data. I cannot understand the data I found. Poorly documented Domain knowledge I cannot use the data I found. Quality Transformation
7
What is a data warehouse?
“A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a way that they can understand and use in a business context.”
8
What is data warehousing?
Data warehousing: techniques for assembling and managing data from various sources for the purpose of answering business questions and making decisions. A data warehouse is a collection of data that is used primarily in organizational decision making. A data warehouse is Subject-oriented Integrated Time-varying Non-volatile Data Information Knowledge
9
Data Warehouse Architecture
Relational Databases Legacy Data Purchased Data ERP Systems Analyze Query Data Warehouse Engine Optimized Loader Extraction Cleansing Metadata Repository
10
Data Warehouse is … Subject-Oriented
The data warehouse is organized around subjects of the enterprise (e.g., customers, products, sales) rather than applications areas (e.g., customer invoicing, stock control, product sales). This is reflected in the need to store decision-support data instead of application-oriented or operational data. Integrated The data warehouse integrates corporate application-oriented data from different sources, which often include inconsistent data. The integrated data sources must be made consistent to present a unified view of the data to the users.
11
Data Warehouse is … Time-Variant
Data warehouses are time variant in the sense that they maintain both historical and (nearly) current data. Historical information is of high importance to decision makers, who often want to understand trends and relationships between data. Non-Volatile After the data are loaded into the data warehouse, there are no changes, inserts, or deletes performed against the historical data. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
12
Operational Systems Run the business in real time.
Based on up-to-the-second data. Optimized to handle large numbers of simple read/write transactions. Optimized for fast response to predefined transactions. Used by people who deal with customers, products. Database systems have been used traditionally for OLTP. Online Transaction Processing Clerical data processing tasks Detailed, up to date data Structured repetitive tasks Examples of Operational Data Customer Files Account Balance, Call Record Point of Sale Data, Production Record
13
Data Warehousing vs. OLTP
Workload Data warehouses are designed to accommodate ad hoc queries. A data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations and might be specifically tuned or designed to support only these operations. Data Modifications A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The users of a data warehouse do not directly update the data warehouse. In OLTP systems, users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
14
Data Warehousing vs. OLTP
Schema Design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency. Typical Operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
15
Data Warehousing vs. OLTP
Historical Data Data warehouses usually store months or years of data to support historical analysis. OLTP systems usually store data from only a few weeks or months to meet the requirements of the current transaction. Number of Users Data Warehouses: hundreds of users. OLTP Systems : tens of thousands users. Database Size Data Warehouses: 10GB - 1TB OLTP Systems: M GB
16
In summary … Data warehousing helps optimize the business.
OLTP systems actually run the business.
17
Data Marts A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs. Departmental Data Warehouse A data warehouse tends to be a strategic but somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an immediate need. The smaller-scale data mart is typically easier to build than the enterprise-wide warehouse; can be quickly implemented; and offers tremendous, fast payback for the users. The downside comes when several department-focused data marts are implemented with no forethought for a future data warehouse that serves the entire enterprise.
18
Independent Data Marts
19
Dependent Data Marts
20
Data Granularity Granularity is the extent to which a system is broken down into small parts, either the system itself or its description or observation. A key factor to consider in the design of data warehouses. The amount of data to be stored in the data warehouse. Operational Databases Transaction Oriented Detailed Records Lowest Level of Granularity The details of the phone call made by Tom at 2:40pm yesterday Data Warehouses Decision Making Summarized Data High Levels of Granularity The number of phone calls made by Tom last month
21
Data Granularity
22
Data Granularity High Levels of Granularity Reduce storage costs.
Reduce CPU usage. Cannot answer certain queries. Did Tom call Mary last week? A tradeoff between the volume and the usage of data. Dual Levels of Granularity Store summarized data on disks. Cover 95% decision making queries. Data access is cheap and convenient. Store detailed data on tapes . Cover 5% decision making queries. Many records need to be involved to process a query. Data access is expensive and complicated. Many levels of granularity may be necessary in practice.
23
Data Partition Acct. No Name Balance Date Opened Interest Rate Address
Frequently Accessed Acct. No Balance Acct. No Name Date Opened Interest Rate Address Rarely Accessed Smaller Table & Less I/O
24
Data Quality Data warehouses are based on existing data sources.
Data quality matters! Creating a data warehouse is not a straightforward process. Warehouse data are from disparate and questionable sources. Legacy systems are no longer documented. Corporate wide standards are not well implemented. Advanced techniques and tools are needed to do the job.
25
10 Minutes …
26
Extract, Transform & Load
Extract, Transform & Load (ETL) The interface between external sources and data warehouses ETL may take around 70% of the total workload. Can be implemented manually in any programming language. Commercial ETL tools are widely available. Extract To consolidate data from different source systems. Flat Files Relational Databases Customized Applications Point of Sale Devices Web Pages To locate the sources for each data item in the data warehouse. Not all data are to be extracted.
27
Extract, Transform & Load
To apply a series of rules or functions to the extracted data to derive the data for loading into the end target. Typical Functions Formatting Encoding Aggregating Splitting Deriving Converting Integrating Load To load the extracted, cleaned and validated data into the end target. Online vs. Offline Loads Incremental vs. Full Loads
28
ETL --- Challenges Same data different name Different data same name
Savings Loans Trust Credit Card Same data different name Different data same name Inconsistent name or data
29
ETL --- Challenges External Sources Data Warehouse appl A - m,f
appl B - 1,0 appl C - x,y appl D - male, female encoding appl A - pipeline - cm appl B - pipeline - in appl C - pipeline - feet appl D - pipeline - yds unit appl A - balance appl B - bal appl C - currbal appl D - balcurr field
30
ETL --- Challenges Same person, different spellings 吕: LV, LUI, LYU
Multiple ways to denote company name Global Systems, GSPL, Global Pty. LTD. Use of different names for the same object/concept Holland vs. Netherland Inconsistent data values Age, Marital Status … Required fields left blank Missing Values Invalid product codes collected at point of sale Manual entry leads to mistakes. Different conventions: using “-1” or “99999” to indicate an error
31
Metadata Metadata is information about data.
Metadata is used to facilitate the understanding, characteristics, and management usage of data. Metadata can document data about data attributes & structure. Metadata may include descriptive information about the context, quality and condition, or characteristics of the data. Metadata for a Book Title, Author, Subject, ISBN, Number of Pages … Metadata for a data warehouse The data defining warehouse objects A roadmap telling users what are in there and how to find them Far more sophisticated than a data dictionary
32
Metadata Repository Data definition and mapping metadata
The meaning of each attribute and where the data come from Data structure metadata The structure of the tables (the data type of each column, primary/foreign key) Source system metadata The data structure of all the source systems feeding in the warehouse ETL process metadata The description of each data flow (source, target, transformation, schedule) Data quality metadata Data quality rules and where they are applicable for, their risk level and actions Audit metadata The results of all processes (ETL, security log, indexing) in the warehouse Usage metadata Records about which reports and cubes are used by who and when
33
Data Models in Data Warehouses
In OLTP systems, data are stored in 2D matrixes. Data warehouses are subject-oriented Profits, Sales … Data need to be reorganized to better reflect the subjects. A data warehouse is based on a multidimensional data model, which views data in the form of a data cube. A data cube allows data to be modeled and viewed in multiple dimensions. Fact tables contain measures of interest (such as dollars sold) and keys to each of the related dimension tables. Dimension tables provide the context of the measures such as item (item name, brand), product, location or time(day, week, month, quarter, year).
34
From Tables to Data Cubes
ID Product Country Date Sales 1 TV US 1Qtr 100 2 PC Canada 4Qtr 500 3 CAR 2Qtr 30 4 UK 3Qtr 200 5 20 6 15 7 80
35
From Tables to Data Cubes
Total annual sales of TV in U.S.A. Date Product Country All, All, All sum TV CAR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada U.K.
36
Cube: A Lattice of Cuboids
all 0-D cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids time,item,supplier item,location,supplier 4-D cuboid time, item, location, supplier
37
Data Warehouse Schemas
Star Schema A fact table in the middle connected to a set of dimension tables Snowflake Schema A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact Constellations Multiple fact tables sharing dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
38
The Star Schema item branch time Sales Fact Table time_key item_key
day day_of_the_week month quarter year time item_key item_name brand type supplier_type item Sales Fact Table time_key item_key branch_key branch_key branch_name branch_type branch location_key street city province_or_street country location location_key units_sold dollars_sold avg_sales Measures
39
The Star Schema: An Example
40
The Snowflake Schema item supplier branch time Sales Fact Table
time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_key item supplier_key supplier_type supplier Sales Fact Table time_key item_key branch_key location_key street city_key location branch_key branch_name branch_type branch location_key units_sold city_key city province_or_street country dollars_sold avg_sales Measures
41
The Galaxy Schema item time Shipping Fact Table Sales Fact Table
time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_type item Shipping Fact Table Sales Fact Table time_key item_key time_key shipper_key item_key from_location branch_key branch_key branch_name branch_type branch location_key to_location location_key street city province_or_street country location dollars_cost units_sold units_shipped dollars_sold avg_sales shipper_key shipper_name location_key shipper_type shipper Measures
42
Concept Hierarchy Location all Europe ... North_America region Germany
Spain Canada ... country Vancouver ... city Frankfurt ... Toronto L. Chan ... M. Wind office
43
Set-Grouping Hierarchy
[$0 - $1000] inexpensive [$0 - $150] moderate expensive
44
View of Hierarchies
45
Bitmap Index Index on a particular column.
Each value in the column corresponds to a bit vector. The length of the bit vector: # of unique records. Not suitable for high cardinality domains Base Table Index on Region Index on Type
46
OLAP Online Analytical Processing
Fast Analysis of Shared Multidimensional Information (FASMI) Slice and Dice: Project and Select Roll up (drill-up): summarize data By climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up From higher level summary to lower level summary or detailed data, or introducing new dimensions Pivot (rotate): Reorient the cube
47
Browsing a Data Cube
48
Slicing and dicing The Telecomm Slice Product Household Telecomm
Regions Europe Video Far East Audio India Retail Direct Special Sales Channel
49
Roll-Up & Drill-Down Sales Channel Region Country State
Higher Level of Aggregation Sales Channel Region Country State Location Address Sales Representative Low-level Details Drill-Down
51
Pivot 3/1 3/2 3/3 3/4 NY LA SF Date 10 Region 47 30 12 Product Juice
Cola Milk Cream 10 Region 47 30 12 Product 3/1 3/2 3/3 3/4
52
OLAP Server Architectures
Relational OLAP (ROLAP) Use relational DBMS to store and manage warehouse data. ROLAP tools access the data in a relational database and generate SQL queries to calculate information at the appropriate level as required. Greater scalability Multidimensional OLAP (MOLAP) Fast query performance due to optimized storage and indexing Automated computation of higher level aggregates of the data Very compact for low dimension data sets. Array model provides natural indexing Hybrid OLAP (HOLAP) User flexibility Low level: relational High-level: array
53
Warehouse Products Computer Associates -- CA-Ingres
Hewlett-Packard -- Allbase/SQL Informix -- Informix, Informix XPS Microsoft -- SQL Server Oracle -- Oracle 7, Oracle Parallel Server Red Brick -- Red Brick Warehouse SAS Institute -- SAS Software AG -- ADABAS Sybase -- SQL Server, IQ, MPP
54
Data Warehouse Vendors
Scalability – Can your infrastructure not just store the required data, but also scale out to service the business with information at the required performance, both today and next year? Can you scale out in terms of users and can your infrastructure handle the needs for these users even if they are working concurrently on these systems? Agility – Can your system deal with changing requirements, with mixed workloads that include loading data while querying, and returning the right answers? Can you easily switch to real-time data loading for some of your data and support operational needs in your business? Enterprise Readiness – Does your infrastructure provide the functionality to always keep your business running. Is the infrastructure delivering the security you need, in terms of who can see what data, but also in terms of disaster recovery, and fraudulent manipulation of data? Is your system running when you need it, and does the infrastructure deliver maximum availability to run your business 24*7?
55
Data Warehouse Vendors
Overall technology ratings comprise user-supplied ratings of functionality, performance, scalability, maintainability, usability, security, portability, ease of integration, ease of implementation, satisfaction and value. Overall vendor ratings comprise user scores of the vendor's credibility, responsiveness, ingenuity, support, vitality, sales process, marketing, legal & accounting functions, licensing practices, and services & training capabilities.
56
Review What is a data warehouse? What is data warehousing?
What is the difference between OLTP and data warehousing? What does ETL stand for? What is the meaning of Metadata? What is the star schema? What is the snowflake schema? What is an OLAP cube? What are the most common OLAP operations?
57
Next Week’s Class Talk Volunteers are required for next week’s class talk. Topic: Business Intelligence Length: 20 minutes plus question time Suggested Points of Interest Aim & Scope Techniques involved Market Vendors & Products Typical applications Supermarkets, Airlines, Financial Institutes … Prospect of employment Major BI companies The future of BI Development trends
58
Project Option--- Data Warehousing
Aim To gain hand-on experiences on data warehousing. To get familiar with popular data warehousing software. To build up teamwork and interpersonal skills. Deliverables Reports Oral Presentation or Poster Due Reports must be submitted before Week 14. Oral presentations and posters are scheduled on Week 15. Software PowerOLAP InstantOLAP Pentaho
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.