Download presentation
Presentation is loading. Please wait.
Published byGordon Eugene Webb Modified over 9 years ago
1
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012
2
Basic Processes Building the data warehouse involves extracting, transforming, and loading (ETL) data from source systems to the target databases. The identification, selection, and Transformation Mapping of source data to target data.
3
Data Loading The source-to-target mapping includes the specification of a process model that covers the many tough issues of data acquisition. Detection of source data changes, data extraction techniques, timing of data extracts, data transformation techniques, frequency of database loads, and levels of data summary are among the difficult data acquisition challenges
4
Processing Steps Extract, Transform, Load (ETL) –Extracting –Data transformation –Loading the data Data cleanup Index creation –Performance requirement Aggregation creation and maintenance Backup Data archiving Data mart refresh
5
Sales Date Dim Sales date key Sales data Sales date month Sales date year Sales Summary Fact sales date key Sales dept key Cat mgr key Product key Qty Dollars Cost Net Category Manager Dim Cat mgr key Category mgr name Distribution center name Store Dept Dim Store dept key Store Store size Store mgr Dept Dept size Dept mgr District Region Product dim Product key Product id Product desc Product sub-category Product category Sample Dimensional Schema
6
Extracting Reading and understanding the source data and copying the parts that are needed to the data staging layer for further work.
7
Transforming Cleansing the data by correcting misspelling, resolving domain conflicts (city vs. zip) Purging fields that are not useful Combining data sources – matching exactly on key values or attributes Creating surrogate keys for dimensions Building aggregates (totals) for boosting performance of common queries
8
Loading and Indexing Replicating the dimension tables and fact tables Bulk loading of each recipient data mart Bulk loading is an important capability in contrast to record at a time loading
9
Quality Assurance Checking Run comprehensive exception reports over newly loaded data All counts and totals must be satisfactory [data audit] Reported values must be consistent with similar values that preceded them before loading new data
10
Release (e.g., Version 3.1) Publishing User community notification Communicates the nature of any changes in dimensions or facts Updates to meta data
11
Updating Incorrect data must be corrected. Changes to the meta data, etc must be made
12
Querying The end goal is to allow access by all authorized uses Takes place on the data warehouse presentation server
13
Important Concepts The requirements for placing extract, transform, and load (ETL) processes into a stable production environment. The technical requirements for these processes including support considerations with purchased ETL software. The challenges of supporting the data warehouse with custom code.
14
The Analyst Must Identify, assess, select, and map source data to target data stores Identify and specify kinds of data transformations (keys, totals, omits, etc.) Manage ETL schedules, including frequency of extract and latency of load Understand the role of meta data (data about data) Identify the classes of technology useful in warehouse data acquisition
15
Who Else Needs to Know this Information? IT designers, developers, and data administrators new to DW Business and technical data warehouse team members Technical business users interested in building sound decision support systems
16
SUMMARY: The Processes Plan the process Identify the tools to be used Clean the data Backup data and processes the data Populate Dimension tables
17
Source data Enterprise data B2B data Web harvesting – the ultimate data store See The Data Webhouse Toolkit by Kimball
18
Identifying data sources Source data assessment and qualification Understanding and modeling source data Triage of source data
19
Source-to-target movement Source-to-target mapping Data transformations Timing considerations Levels of detail Processes and flows
20
Meta data considerations Data structure layouts and data element documentation Required meta data Support of meta data propagation
21
Requirements for stable production processing Scheduling Logging Recovery
22
Extract, Transform, and Load technology Extraction - Buy versus build Matching needs to technology
23
Software XML – (eXtensible Markup Language) Used in moving data around among applications
24
ETL activities
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.