Download presentation
Presentation is loading. Please wait.
1
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business
2
2 Technical Architecture Design Product Selection & Installation End-User Application Specification End-User Application Development The Business Dimensional Lifecycle Project Planning Business Requirement Definition Business Requirement Definition Deployment Maintenance and Growth Project Management Dimensional Modeling Physical Design Data Staging Design & Development
3
3 Data Staging Data Warehouse (Oracle) DB2 Access Excel Legacy System Data Staging
4
4 Extraction Data Cleansing Data Integration Transformation Transportation (Loading) Maintenance
5
5 Extraction Extract source data from legacy systems and place it in a staging area. To reduce the impact on the performance of legacy systems, source data is extracted without any cleansing, integration and transformation operations.
6
6 Extraction A variety of file formats exist in legacy systems –Relational database: DB2, Oracle, SQL Server, Informix, Access … –Flat file: Excel file, text file Commercial data extraction tools are very helpful in data extraction. –Ex: Oracle Data Mart Builder
7
7 Data Preparation (Cleansing) It’s all about data quality!!!
8
8 Outline Measures for Data Quality Causes for data errors Common types of data errors Common error checks Correcting missing values Timing for error checks and corrections Steps of data preparation
9
9 Measures for Data Quality Correctness/Accuracy - w.r.t. the real data Consistency/Uniqueness – data values, references, measures and interpretations Completeness - scope of data & values Relevancy – w.r.t. the requirements Current data – relevant to the requirements
10
10 Causes for Data Errors Data entry errors Correct data not available at the time of data entries By different users same time or same users overtime –Inconsistent or incorrect use of “codes” –Inconsistent or incorrect interpretation of “fields” Transaction processing errors System and recovery errors Data extract/transformation errors
11
11 Common Data Errors Missing (null) values Incorrect use of default values (e.g., zero) Data domain integrity violation (e.g., 0/1) Data value (dependency) integrity violation (e.g., if mm=02 then DD<30) Data referential integrity violation (e.g., a customer’s order record cannot exist unless the customer record already exists)
12
12 Common Data Errors, Cont’d Data retention integrity violation (e.g., old inventory snapshots should not be stored) Data Derivation/Transformation/Aggregation Integrity Violation (e.g., profit not = sales – costs) Inconsistent data values of the same data (M versus m for male) Inconsistent use of the same data value (DM for Data Mining and Data Marts)
13
13 Error Checks Domain value validation Value dependency validation Referential integrity validation Identify missing-value or default-value records Identify outliers Cross-footing -Check aggregates and derivations across different levels and against common sense Eyeballs! Process validation
14
14 Data Cleaning: Missing Values 1.Exclude the record 2.Exclude the attribute/field 3.Replaced by a global constant 4.Replaced by the attribute mean 5.Replaced by the most probable value 6.Apply 4 – 6 by class/segments of records 7.Manual correction 8.Application specific algorithm 1-6 are less practical for OLAP bound data
15
15 Timing for Error Checking During Data Staging During Data Loading Others –Before data extraction (data entries, transaction processing, recovery, audits, etc.) –After data loading
16
16 Steps of Data Preparation Identify data sources Extract and analyze source data Standardize data Correct and complete data Match and consolidate data Analyze data defect types Transform and enhance data into target Calculate derivations and summary data Audit and control data extract, transformation and loading
17
17 Data Integration Data from different data sources with different formats need to be integrated into one data warehouse –Ex: 3 customer table in sales department, marketing department and an acquired company Customer (cid, cname, city …) Customer (customerid, customername,city…) Customer (custid, custname, cname,…)
18
18 Data Integration Same attribute with different name: cid, customerid, custid Different attribute with same name: –cname -> customer name –cname -> city name Same attribute with different formats
19
19 Data Integration How to integrate –Get the schemas of all data sources –Get the schema of the data warehouse –Integrate source schemas with the help from commercial tools and domain experts
20
20 Transformation Prepare data for loading into the data warehouse –Change the data format –Create derived attributes and tables –Aggregate –Create warehouse keys
21
21 Transportation Using bulk load tools, such as Oracle SQL Loader, instead of SQL command Create indexes
22
22 Maintenance Maintenance frequency: daily, weekly, monthly Identify change records and new records in legacy systems –Create timestamps for changes and new records in legacy systems –Compare data between legacy systems and DW Load changes and new records into DW
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.