Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intro. to Data Warehouse

Similar presentations


Presentation on theme: "Intro. to Data Warehouse"— Presentation transcript:

1 Intro. to Data Warehouse
รศ.ดร. วรพจน์ กรีสุระเดช Worapoj Kreesuradej, Ph.D. Associate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Web:

2 Book Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001. Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002.

3 Definition of DW “A collection of integrated, subject-oriented databases designed to supply the information required for decision-making.” - W. Inmon A decision support database that is maintained separately from the organization’s operational databases. A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc.

4 R. Kimball’s definition of a DW
A data warehouse is a copy of transactional data specifically structured for querying and analysis.

5 Problem: Data Management in Large Enterprises
Vertical fragmentation of informational systems Result of application (user)-driven development of operational systems Sales Administration Finance Manufacturing ... Sales Planning Stock Mngmt Suppliers Debt Mngmt Num. Control Inventory

6 Problem: Data Management in Large Enterprises
Two Approaches for accessing data: Query-Driven (Lazy) Warehouse (Eager) ? Source Source

7 The Need for DW . . . . . . Query-driven (lazy, on-demand) Clients
Integration System Metadata . . . Wrapper Wrapper Wrapper . . . Source Source Source

8 Disadvantages of Query-Driven Approach
Delay in query processing Inefficient and potentially expensive for frequent queries Competes with local processing at sources

9 The Warehousing Approach
Information integrated in advance Stored in wh for direct querying and analysis Clients Data Warehouse Integration System Metadata . . . Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor . . . Source Source Source

10 Advantages of Warehousing Approach
High query performance Doesn’t interfere with local processing at sources Information copied at warehouse Can modify, annotate, summarize, restructure, etc. Can store historical information Security, no auditing

11 Characteristics of DW Subject oriented
Data are organized by how users refer to it Integrated Inconsistencies are removed in both nomenclature and conflicting information; (i.e. data are ‘clean’) Non-volatile Read-only data. Data do not change over time. Time variant Data are time series, not current status

12 Subject Oriented Data Warehouse is designed around
“subjects” rather than processes A company may have Retail Sales System Outlet Sales System Catalog Sales System DW will have a Sales Subject Area

13 Subject Oriented OLTP Systems Data Warehouse Catalog Sales
Retail Sales System Outlet Sales System Catalog Sales Sales Subject Area Subject-Oriented Sales Information Data Warehouse OLTP Systems

14 Integrated Heterogeneous Source Systems Need to Integrate source data
For Example: Product codes could be different in different systems Arrive at common code in DW

15 Integrated Information integrated in advance
Stored in DW for direct querying and analysis Clients Data Warehouse Source . . . Extractor/ Monitor Integration System Metadata

16 Non-Volatile Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data.

17 Non-Volatile(Read-Mostly)
Write OLTP USER Read DW USER Read

18 Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

19 Time Variant Most business analysis has a time component
Trend Analysis (historical data is required) Sales

20 Data Warehousing Process Overview

21 Data Warehousing Process Overview
The major components of a data warehousing process Data sources Data extraction Data loading Comprehensive Database /Data Store Data Mart Metadata Middleware tools /information delivery tools

22 ETL Data Extraction Data Cleaning and Transformation
Convert from legacy/host format to warehouse format Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partition

23 The ETL Process Source Systems Staging Area DW Database Transform
Extract Load

24 Data Staging Area A storage area where extracted data is cleaned, transformed and deduplicated. Initial storage for data Need not be based on Relational model Mainly sorting and Sequential processing Does not provide data access to users Analogy – kitchen of a restaurant

25 ETL Process Issues & Challenges Consumes 70-80% of project time
Heterogeneous Source Systems Little or no control over source systems Source systems scattered Different currencies, measurement units Ensuring data quality

26 Comprehensive Database /Data Store
Mostly a relational DB Oracle, DB2, Sybase, SQL Server New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing)

27 Data Warehouse Design OLTP Systems are Data Capture Systems
“DATA IN” systems DW are “DATA OUT” systems OLTP DW

28 Dimensional Modeling Facts are stored in FACT Tables
Dimensions are stored in DIMENSION tables Dimension tables contains textual descriptors of business Fact and dimension tables form a Star Schema “BIG” fact table in center surrounded by “SMALL” dimension tables

29 Star Schema

30 Star Schema

31 Data mart Data mart = subset of DW for community users, e.g. accounting department Sometimes exist as Multidimensional Database Info mart = summarized data + report for community users

32 Meta Data Data about data
Needed by both information technology personnel and users IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.

33 Information Delivery Tools
Query & reporting OLAP Data mining, visualization, segmentation, clustering New developments: text mining, web mining & personalization Mining multimedia data

34 Information Delivery Tools
Commercial tools Crystal Report, Impromptu, WebFocus Increasingly common mode of delivery: Web-enabled

35 Data Warehouse Architecture
Data Flow Architecture System Architecture

36 Data Flow Architecture

37 Data Flow Architecture

38 Data Flow Architecture
Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files MDB=Multidimensional databases

39 System Architectures Three parts of the data warehouse
The data warehouse that contains the data and associated software Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse Client (front-end) software that allows users to access and analyze data from the warehouse

40 System Architectures

41 System Architectures

42 System Architecture

43 System Architecture

44 Data Warehouse Development
Data warehouse development approaches Inmon Model: EDW approach, Enterprise-wide warehouse, top down Kimball Model: Data mart approach, Data mart, bottom up Which model is best? There is no one-size-fits-all strategy to data warehousing When properly executed, both result in an enterprise-wide data warehouse, but with different architectures

45 The Data Mart Strategy The most common approach
Begins with a single mart and architected marts are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data warehousing Can perpetuate the “silos of information” problem Can postpone difficult decisions and activities Requires an overall integration plan

46 The Enterprise-wide Strategy
A comprehensive warehouse is built initially An initial dependent data mart is built using a subset of the data in the warehouse Additional data marts are built using subsets of the data in the warehouse Like all complex projects, it is expensive, time consuming, and prone to failure When successful, it results in an integrated, scalable warehouse

47 DW Lifecycle (Ralph Kimball )

48 Data Warehouse Development
Some best practices for implementing a data warehouse (Weir, 2002): Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability

49 Data Warehouse Development
Some best practices for implementing a data warehouse (Weir, 2002): The project must be managed by both IT and business professionals Develop a business/supplier relationship Only load data that have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware


Download ppt "Intro. to Data Warehouse"

Similar presentations


Ads by Google