Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran
Main Topics Brief Overview of Data Warehouse Concept of Data Conversion Importance of Data conversion and the steps involved Common Industry Methodology Outline and Analysis done in the Alternate Plan paper
Data warehousing It is a concept and not a product A method to analyze massive amounts of data to make better business decisions. Helpful in analyzing Sales data(E.g..) and make decisions that affect the company’s performance. A Data warehouse in general contains Summarized, De-normalized and Replicated data that is infrequently updated and is optimized for decision support applications.
Comparison between Operational Environment and Data Warehouse Detailed Current Transaction Driven Minimum redundancy Static Structure Small amount of data Constantly updated Summarized Variable over time Analysis driven Some redundancy Flexible structure Huge volumes of data Infrequently Updated Data WarehouseOperational Environment
Data Warehouse Concepts Multidimensional Model a) Facts - Table containing aggregate information required for analysis. b) Dimensions - Classes of descriptors of the facts. c) Hierarchies - Level of Aggregation of data. Databases a) Relational i) Oracle b) Multi-Dimensional i) Oracle Express ii) Essbase iii) Gentium
Implementation Steps Analyze user requirements for the Data warehouse. Analyze existing transaction Processing Data. Design the Data warehouse (Multi-dimensional Model) Create the Data warehouse (Relational or Multi- dimensional) Extract and clean the operational data. Migrate and load the data into the warehouse. Do decision support analysis on the warehouse data using OLAP tools. Create reports for reporting purposes.
Data Warehouse Architecture Terminology's a) OLTP systemsd) Staging Area b) Metadatae) Extraction, Loading & Migration c) Data Warehousef) External Data
Data Warehouse Architecture (Contd..) OLTP Systems –Online Transaction Processing Systems, Production Systems. Systems used to manage and run the business. Metadata –consists of information about the data that feeds, gets transformed and exists in the Data Warehouse Data Warehouse –Core of the Architecture –supports informational processing by providing a solid platform of integrated, historical data from which to do analysis
Data Warehouse Architecture (Contd..) Staging Area –Data Warehouse workbench –the place where raw data is brought in, cleaned, combined, archived and eventually exported to either the Data Warehouse or to one or more Data Marts Extraction, Cleaning & Loading –Known as the Data Conversion process. –The process by which data from the operational systems are moved to the Warehouse –One of the most important steps in the implementation of a Data Warehouse. External Data
Data Conversion Loading of data from the operational system to the Data warehouse. Process wherein data is extracted, cleaned, combined, archived and eventually loaded into the Data warehouse. Complex, time-consuming and unglamorous. Comprises of the following processes: a) Extraction b) Cleaning c) Loading Very, Very important section of the Data warehousing process.
Importance of Data Conversion The Data warehouse holds the information that is the key to a corporation’s decision making process. Unreliable and “Dirty” data can effect the performance of the corporation. Examples a) Marketing communications. b) Retail Sales c) Medical records
Steps in Data Conversion Extract data from the operational systems to intermediate schema (Staging area). - Staging area is the Data warehouse workbench where the data is cleaned, combined, archived and eventually exported to the Data warehouse.. It has the same schema structure as the operational system. Convert the intermediate schema to “load data”. Aggregate the “load data”. Migrate the “load data” from the staging area to the Data Warehouse server (if the staging area is not on the same server as the warehouse). Load the data into the Data warehouse.
Data Conversion Process
Data Conversion Extraction - Routines are created to read source data and move it to an intermediate staging area. - Staging Area has the same schema as the source. It is important as the data is cleaned before it is uploaded into the warehouse. Convert intermediate Schemas to “Load Data” - Data cleaning process. It comprises of: - Data examination - Data parsing - Data correction - Record matching - Data transformation
Data Conversion (Contd..) Aggregate “Load data” - “Load data” is aggregated by executing a series of sorts externally. Move the “Load data” from the staging area onto the Data warehouse server - Done if the Data warehouse server is different Load the data onto the Data warehouse - Done using SQL routines or bulk-load utilities.
Paper Outline Brief explanation of Data warehousing concept Data warehouse architecture Data conversion Importance of data conversion Common Industry methodology Analysis of Data conversion process using an example: - Sales Order System
Overall Analysis Concept of the paper was to outline the Data Conversion process. Design a Relational Database, Staging Area and Data Warehouse. Move Data from the Relational database to the Staging Area Move Data from the Staging area to the Warehouse.
In-depth Analysis Designed the Relational Database to reflect the Transactional processing system of a common Organization. Designed the Staging Area to reflect only the Sales system. Designed the Data Warehouse for the Sales system. Built the relational database(source system) for the quoted example (Sales System) in Oracle Built the Staging Area in Oracle. Built the Data Warehouse in Oracle (Multi Dimensional Design in a relational Database). Created Views for the source tables(Transparency) Created synonyms for the views (as source tables were in a different server)
In-depth Analysis (Contd..) Wrote SQL scripts to first move data from the synonyms created, to the Staging area. Wrote SQL scripts and procedures to move data from the Staging Area to the Data Warehouse. –Data was moved first from the Staging area tables to the dimension tables namely Product, Location and Customer. –Time dimension table was populated with 10 years of data. Additional scripts were written to populate the time dimension with data every year. –Data was moved from the Staging area to the fact table (Core Table). Wrote scripts to check for the consistency of data. These scripts checked the total records moved from the Source system to the Satging area and from the Staging area to the Data Warehouse. Additionally, they checked for the total amount moved from the database to the Data Warehouse.
Conclusion The importance of the Data warehouse can only be achieved by OLAP analysis and Data Mining. Data Conversion is one of the most critical process in implementing a Data warehouse Warehouse holds the information that is of great value to the enterprise Data conversion process must be done effectively and efficiently