Download presentation
1
Intro. to Data Warehouse
รศ.ดร. วรพจน์ กรีสุระเดช Worapoj Kreesuradej, Ph.D. Associate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Web:
2
Book Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001. Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002.
3
Definition of DW “A collection of integrated, subject-oriented databases designed to supply the information required for decision-making.” - W. Inmon A decision support database that is maintained separately from the organization’s operational databases. A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc.
4
R. Kimball’s definition of a DW
A data warehouse is a copy of transactional data specifically structured for querying and analysis.
5
Problem: Data Management in Large Enterprises
Vertical fragmentation of informational systems Result of application (user)-driven development of operational systems Sales Administration Finance Manufacturing ... Sales Planning Stock Mngmt Suppliers Debt Mngmt Num. Control Inventory
6
Problem: Data Management in Large Enterprises
Two Approaches for accessing data: Query-Driven (Lazy) Warehouse (Eager) ? Source Source
7
The Need for DW . . . . . . Query-driven (lazy, on-demand) Clients
Integration System Metadata . . . Wrapper Wrapper Wrapper . . . Source Source Source
8
Disadvantages of Query-Driven Approach
Delay in query processing Inefficient and potentially expensive for frequent queries Competes with local processing at sources
9
The Warehousing Approach
Information integrated in advance Stored in wh for direct querying and analysis Clients Data Warehouse Integration System Metadata . . . Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor . . . Source Source Source
10
Advantages of Warehousing Approach
High query performance Doesn’t interfere with local processing at sources Information copied at warehouse Can modify, annotate, summarize, restructure, etc. Can store historical information Security, no auditing
11
Characteristics of DW Subject oriented
Data are organized by how users refer to it Integrated Inconsistencies are removed in both nomenclature and conflicting information; (i.e. data are ‘clean’) Non-volatile Read-only data. Data do not change over time. Time variant Data are time series, not current status
12
Subject Oriented Data Warehouse is designed around
“subjects” rather than processes A company may have Retail Sales System Outlet Sales System Catalog Sales System DW will have a Sales Subject Area
13
Subject Oriented OLTP Systems Data Warehouse Catalog Sales
Retail Sales System Outlet Sales System Catalog Sales Sales Subject Area Subject-Oriented Sales Information Data Warehouse OLTP Systems
14
Integrated Heterogeneous Source Systems Need to Integrate source data
For Example: Product codes could be different in different systems Arrive at common code in DW
15
Integrated Information integrated in advance
Stored in DW for direct querying and analysis Clients Data Warehouse Source . . . Extractor/ Monitor Integration System Metadata
16
Non-Volatile Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data.
17
Non-Volatile(Read-Mostly)
Write OLTP USER Read DW USER Read
18
Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
19
Time Variant Most business analysis has a time component
Trend Analysis (historical data is required) Sales
20
Data Warehousing Process Overview
21
Data Warehousing Process Overview
The major components of a data warehousing process Data sources Data extraction Data loading Comprehensive Database /Data Store Data Mart Metadata Middleware tools /information delivery tools
22
ETL Data Extraction Data Cleaning and Transformation
Convert from legacy/host format to warehouse format Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partition
23
The ETL Process Source Systems Staging Area DW Database Transform
Extract Load
24
Data Staging Area A storage area where extracted data is cleaned, transformed and deduplicated. Initial storage for data Need not be based on Relational model Mainly sorting and Sequential processing Does not provide data access to users Analogy – kitchen of a restaurant
25
ETL Process Issues & Challenges Consumes 70-80% of project time
Heterogeneous Source Systems Little or no control over source systems Source systems scattered Different currencies, measurement units Ensuring data quality
26
Comprehensive Database /Data Store
Mostly a relational DB Oracle, DB2, Sybase, SQL Server New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing)
27
Data Warehouse Design OLTP Systems are Data Capture Systems
“DATA IN” systems DW are “DATA OUT” systems OLTP DW
28
Dimensional Modeling Facts are stored in FACT Tables
Dimensions are stored in DIMENSION tables Dimension tables contains textual descriptors of business Fact and dimension tables form a Star Schema “BIG” fact table in center surrounded by “SMALL” dimension tables
29
Star Schema
30
Star Schema
31
Data mart Data mart = subset of DW for community users, e.g. accounting department Sometimes exist as Multidimensional Database Info mart = summarized data + report for community users
32
Meta Data Data about data
Needed by both information technology personnel and users IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.
33
Information Delivery Tools
Query & reporting OLAP Data mining, visualization, segmentation, clustering New developments: text mining, web mining & personalization Mining multimedia data
34
Information Delivery Tools
Commercial tools Crystal Report, Impromptu, WebFocus Increasingly common mode of delivery: Web-enabled
35
Data Warehouse Architecture
Data Flow Architecture System Architecture
36
Data Flow Architecture
37
Data Flow Architecture
38
Data Flow Architecture
Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files MDB=Multidimensional databases
39
System Architectures Three parts of the data warehouse
The data warehouse that contains the data and associated software Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse Client (front-end) software that allows users to access and analyze data from the warehouse
40
System Architectures
41
System Architectures
42
System Architecture
43
System Architecture
44
Data Warehouse Development
Data warehouse development approaches Inmon Model: EDW approach, Enterprise-wide warehouse, top down Kimball Model: Data mart approach, Data mart, bottom up Which model is best? There is no one-size-fits-all strategy to data warehousing When properly executed, both result in an enterprise-wide data warehouse, but with different architectures
45
The Data Mart Strategy The most common approach
Begins with a single mart and architected marts are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data warehousing Can perpetuate the “silos of information” problem Can postpone difficult decisions and activities Requires an overall integration plan
46
The Enterprise-wide Strategy
A comprehensive warehouse is built initially An initial dependent data mart is built using a subset of the data in the warehouse Additional data marts are built using subsets of the data in the warehouse Like all complex projects, it is expensive, time consuming, and prone to failure When successful, it results in an integrated, scalable warehouse
47
DW Lifecycle (Ralph Kimball )
48
Data Warehouse Development
Some best practices for implementing a data warehouse (Weir, 2002): Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability
49
Data Warehouse Development
Some best practices for implementing a data warehouse (Weir, 2002): The project must be managed by both IT and business professionals Develop a business/supplier relationship Only load data that have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.