Download presentation
Presentation is loading. Please wait.
Published byRaymundo Smallidge Modified over 10 years ago
1
Extract, Transform, Load 1
2
Agenda Review Analysis (Bus Matrix, Info Package) Logical Design(Dimensional Modeling) Physical Design(Spreadsheet) Implementation(Data Mart Relational Tables) ETL Process Overview ETL Components Staging Area Extraction Transformation Loading Documenting High-Level ETL Requirements Documenting Detailed ETL Flows Example ETL 2
3
Review: Dimensional Modeling 3
4
Review: DM Implementation DimStudent FactEnrollment CREATE TABLE DimStudent( student_sk int identity(1,1), student_id varchar(9), firstname varchar(30), lastname varchar(30), city varchar(20), state varchar(2), major varchar(6), classification varchar(25), gpa numeric(3, 2), club_name varchar(25), undergrad_school varchar(25), gmat int, undergrad_or_gradvarchar(10), CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO CREATE TABLE FactEnrollment( student_sk int, class_sk int, date_sk int, professor_sk int, course_grade numeric(2, 1), CONSTRAINT factenrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk), CONSTRAINT factenrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk), CONSTRAINT factenrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk), CONSTRAINT factenrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk), CONSTRAINT factenrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk)); GO 4
5
Review: Physical DW Design 5
6
ETL Overview Reshaping relevant data from source systems into useful information stored in the DW Extract Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW Transform Cleaning and converting data to prepare it for loading into the DW Load Putting cleansed and converted data into the DW 6
7
ETL Process Not Really New, BUT… Much more data Includes rearranging, summarizing Data used for strategic decision-making Characteristics: Process AND technology Detailed, highly-dependent tasks Consumes average 75% of DW development An on-going process for life of DW Requirements: Well-documented Automated Flexible 7
8
ETL Process 1. Determine target data 2. Determine data sources 3. Prepare data mapping 4. Organize data staging area 5. Establish data extraction rules 6. Establish data transformation rules 7. Plan aggregate tables 8. Establish data load procedures 9. Load dimension tables 10. Load fact tables 8
9
ETL Process Flow 9 1, Dim Model 2, Spreadsheet 3, Spreadsheet 4 5, SSIS 6, 7, Map & SSIS 8, 9, 10, SSIS
10
ETL Staging Area 10 Information hub, facilitating the enriching stages that data goes through to populate a DW Advantages: Separates source systems and DW Minimizes ETL impact on source AND DW systems Can consist of multiple “hubs” “upload” area “staging” area “DW load images”
11
ETL Staging Area, cont… 11
12
High Level Design of ETL Process Initial documentation of: What data do we need and where is it coming from? Physical DW Design Spreadsheet shown previously What are the major transformation/cleansing needs? “Extend” Physical DW Design Spreadsheet OR ETL Map What’s the sequence of activities for ETL? ETL Map 12
13
Common Transformations Format Revisions Key Restructuring, Lookup Handling of Null Values Decoding fields Calculated, Derived values Merging of Data 13
14
Common Transformations, cont… Splitting of single fields Character set conversion Units of measurement conversion Date/time conversion Summarization Deduplication 14
15
Common Transformations, cont… Other Data Quality Issues Standardize values Validate values Identifying mismatches, misspellings Etc… Suggestions: Appoint “Data Stewards” Ensure ETL programs have control checks Data Profiling… 15
16
Comparison of Models 16
17
Transformations Example DimTimeDimProfessorDimClassDimStudentFactEnrollment Create tableGenerate SK Add SKs: student, section, prof (join registration to student, time, and section dims; left join them to prof) Insert row w/SK = -1 Expand rank values (use SQL case) Get coursename & cred hrs from section tbl (join section to course) Expand classification values (use SQL case) Expand department values (join prof to departments) Expand state values (needs lookup table but use SQL case instead) Get gmat, undergrad school from grad table (join student to grad) Get club name from club (join student to undergrad; Left join them to club) Create undergrad_or_grad values (if stud_id in undergrad or stud_id in grad) 17
18
Data Profiling Systematic analysis of the content of a data source Goals: Anticipate potential data quality issues upfront Build quality corrections and controls into ETL process Manual and/or Tool-assisted 18
19
Profiling Example: Manual CustID Account Number Customer TypeTitle First Name Last NameGenderEmailPhoneAddress Line1 Address Line2State Postal CodeCountry 11000 AW000110 00IMr.JonYangF jon24@adventure- works.com. 1(11) 500 555- 01623761 N. 14th St Queensland4700AU 11001 AW000110 01I EugeneHuangF eugene10@adventure- works.com.500-555-01102243 W St. Victoria3198AU 11002 AW000110 02I RubenTorresF ruben35@advanture- works.com. 1(11) 500 555- 01845844 Linden Dr New South Wales7001AU 11003 AW000110 03I ChristyZhuF christy12@adventure- works.com. 1(11) 500 555- 01621825 Village Pl. Queensland2113 11004 AW000110 04IMrs.ElizabethJohnsonF elizabeth5@adventure- works.com.(500) 555-0131 7553 Harness Circle 2500AU 11005 AW000110 05I JulioRuizM julio1@adventure- works.com. 1(11) 500 555- 0151 7305 Humphrey Drive New South Wales4169OZ 19
20
Profiling Example: SSIS 20
21
Documenting ETL High Level Design Add to existing DW Physical Design Spreadsheet 21
22
Documenting ETL High Level Design 22
23
Low Level Design of ETL Process Detailed documentation of: What data do we need and where is it coming from? What are the major transformation/cleansing needs? What’s the sequence of activities for ETL? Can use tool like SSIS 23
24
Extracting Source Data Two forms: 1. Static Data Capture Point-in-time snapshot Initial Loads and periodic refreshes 2. Revised Data Capture Only data that has been added, updated, deleted since last load Ongoing incremental loads Two timeframes Immediate Deferred 24
25
Static Data Capture (T)SQL Scripts e.g., small number of tables/rows Export/Import Tables e.g., database or non-database sources Backup/Restore Database e.g., copying sqlserver source database for initial load ETL Detach/Attach Database e.g., copying older sqlserver version to newer sqlserver version for initial load ETL 25
26
Revised Data Capture Immediate / Real-time ETL side: procs get changed data from log real-time and update ETL staging tables OLTP side: triggers update ETL staging tables OLTP side: apps write to OLTP AND ETL staging tables Deferred ETL side: procs get changed data from OLTP tables based on timestamps ETL side:procs do file comparison OLTP side:changed data capture (SS 2008) 26
27
Documenting ETL Low Level Design: SSIS Comes with SQL Server Helps document and automate ETL process Based on defining Packages Tasks One approach A package for each target table A "master" package 27
28
SSIS Package Examples: Master 28
29
SSIS Package Examples: Extract All 29
30
SSIS Package Examples: Extract Changed using CDC 30 Eg, SELECT * from cdc- customer WHERE cdc_chg_date > etl_last_capture_date;
31
SSIS Package Examples: Transforms 31
32
SSIS Package Examples: Load 32
33
Class Performance DW Example Create ClassPerformanceDW database Using ClassPerformanceDW database… Create ClassPerformanceDW tables using SQL Script http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables/create_class_performance_dw_tables.sql http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables/create_class_performance_dw_tables.sql 33
34
ETL Example using SQL Scripts One "Master Script" Calls five "table" scripts 34
35
"Master" Script 35 --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading factenrollment table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql" Print 'class performance DW data transformation and loading is complete' Go
36
Load "DimProfessor" Script (pg. 1 of 3) 36 set nocount on print 'remove existing data from dimprofessor' delete from dimprofessor; go print 'reseeding SK identity value back to 1' dbcc checkident ('dimprofessor', reseed, 0); go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id, firstname, lastname, rank, department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof ; go
37
Load "DimProfessor" Script (pg. 2 of 3) 37 print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; Go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.department FROMdimprofessor, regnOLTP.dbo.departments WHEREdimprofessor.department = regnOLTP.dbo.departments.prefix ; Go
38
Load "DimProfessor" Script (pg. 3 of 3) 38 print 'adding SK -1 row' set identity_insert dimprofessor on Go insert into dimprofessor ( professor_sk, professor_id, firstname, lastname, rank, department) Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insert dimprofessor off Go Set nocount off
39
Load "FactEnrollment" Script 39 print 'adding oltp registration data to fact_enrollment' INSERT INTO factenrollment ( student_sk, class_sk, date_sk, professor_sk, course_grade) SELECT student_sk, class_sk, datekey, professor_sk, final_grade FROM ((((regnOLTP.dbo.registration INNER JOIN dimstudent ON registration.stud_id = dimstudent.student_id) INNER JOIN dimclass ON regnOLTP.dbo.registration.callno = dimclass.crn) INNER JOIN dimtime ON CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey) INNER JOIN regnOLTP.dbo.section ON dimclass.crn = regnOLTP.dbo.section.callno) LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id = dimprofessor.professor_id ; Go
40
Entire Transform/Load "Package" http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables.zip 40
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.