Download presentation
Presentation is loading. Please wait.
1
ETL – Extract, Transform, Load
BCHB697
2
Outline Data Warehouse Data sources Extract, Transform, Load
BCHB697 - Edwards
3
Online Transaction Processing (OPT) Databases
Many small create, update, delete operations over time (fast) Capture of detailed data elements associated with a (business) process Often tied to data-entry forms, accounts, web-sites, or automated monitoring/logging Atomic (transactional) create, update, delete operations on (a few) tables Not (very) good for high-level analytics BCHB697 - Edwards
4
Data Warehouse Integration, aggregation of many data-sources: OTP, flat files, reports Used for business analytics, decision making, summarizing, clustering, etc… Lots of data-cleanup, data-wrangling issues to transform source data to coherent data-model (ETL) Batch/bulk loads, so transactions not needed. Large scale queries, report generation BCHB697 - Edwards
5
Data-Sources (Extract)
Data-dump from OTP database Aggregated and/or denormalized table from OTP database (SQL-query) Excel-spreadsheets URLs, XML, web-service responses CSV, TSF files “Flat” files (UniProt, Genbank?) BCHB697 - Edwards
6
Data-Sources (Extract)
Prototype a process for obtaining a realization (file) of the information from the data-source Check whether the information is in a easily handled (file) format Check whether all desired data-elements are present. Iterate until data-source process with desired data-elements can be automated. Establish update process: New rows/records only, since last date? Entire data-source realized each time? What if the data-source changes? BCHB697 - Edwards
7
Transform Manipulate data-sources’ realization to match data warehouse logical data-model. Resolve individual data value issues: Numbers: data-type, precision, units, format Dates: syntax/format, time-zones, precision Missing values: format, keywords/symbols? Strings: White-space, capitalization, punctuation Multi-values: delimiter(s), order, duplicates Free-text: misspelling, controlled vocab, semantic terms, natural language processing BCHB697 - Edwards
8
Transform Explore each data-source realization value to determine potential issues Prototype resolution strategy and check whether issues remain Iterate until data-values can be cleaned in an automated fashion Retain exploratory tools for update process: New values might have additional issues BCHB697 - Edwards
9
Transform: Numbers Check: Do the numbers have a valid min/max?
Check: Is the number represented as a string, or vice-versa? Precision: Do all numbers have a consistent (and the desired) precision? Integer/float? Units: Are all numbers in the same units? Format: scientific notation? BCHB697 - Edwards
10
Transform: Time and Date
What date is: 1/2/12? Is it Jan 2nd, 2012 or Feb 1st, 1912. Many equivalent formats for date and time Even when the information is clear, lots of code Various countries use specific conventions, month names 24-hour vs 12-hour, AM/PM Precision: Include seconds, microseconds? How to recognize the year alone? What are reasonable values for year? What about timezone? Different records have local times? BCHB697 - Edwards
11
Transform: Missing Values
Some missing values are so important the record/instance must be removed. Blank, “N/A”, “NA”, “-”, “?”, “ ”, “Unknown” Determine what values indicate missing Sometimes unknown is different than missing Blank rows or missing values sometimes indicate issues with Extract or other transform steps. BCHB697 - Edwards
12
Transform: Strings Often string values have multiple values for the same thing: Leading, trailing white-space Capitalization and punctuation Spelling mistakes, synonyms, abbreviations Detect, decide on a canonical “term” and replace variant values Will often need updating for new data extracted from the data-sources BCHB697 - Edwards
13
Transform: Multi-Values
Single-strings holding multiple values: e.g. Keywords: protein, kinase, enzyme e.g. Accessions: >sp|A0PJZ0|A20A5_HUMAN The delimiter delineates each token Beware double delimiters: what does || mean? Does the order of tokens matter? Is this a set or a list? Tokens themselves may have the same issues as strings Normalize sets by unique-ing, sorting, and then re-joining tokens BCHB697 - Edwards
14
Transform: Free-Text Free-text is sometimes left alone, kept for presentation, browsing, debugging …but sometimes it contains gold! Picking out the nuggets: Natural language parsing Pattern-recognition (dates?) Controlled vocabularies Semantically constrained strings Reliable delimiters are rare BCHB697 - Edwards
15
Transform: Free-Text BCHB697 - Edwards
16
Transform: Tools Excel Open-Refine Python
Auto-filter, text-to-columns, remove duplicates, pivot tables Open-Refine For messy data-tables, reconciliation Python Many (simple) scripts, automated Store configuration/vocab/etc. in human editable files BCHB697 - Edwards
17
Transform: Deidentification
Some extracted fields are “identifying” Social-security number (US) Date-of-birth Address Actions: Delete the column Transform: DOB + Event Date -> Event Age Loss of precision: Address -> Zip Code or State; Date -> Year Be careful, release of personally identifiable information has legal consequences… BCHB697 - Edwards
18
Load: Entity Resolution
For each instance (row) to be loaded to the data warehouse: How is the instance identified? External identifier, name, resolve ambiguity? Is it already in the DW? Check for existence of external identifier, ID Get surrogate (internal?) primary key for instance Lookup table based on external identifier For initial load, can sometimes assume unique instances, but not true for update… BCHB697 - Edwards
19
Load: Entity Resolution
Related information from different sources Need to figure out which students are already in the person table(s), and if the phone numbers are already present (or inconsistent) Person (PersonID, FullName) Phone (PersonID, Type, Number) Student (NetID, FirstName, FamilyName, WorkPhone, HomePhone) BCHB697 - Edwards
20
Load: DBMS Load each DW table with tabular data
TSV, CSV format SQL-format insertions Can be very time consuming (days) Large data volume, writes slower than reads Carry out bulk load as one step, not lots of small inserts. During load, turn off features designed to make queries faster, since these make inserts slower. Plan for incremental update not just initial load BCHB697 - Edwards
21
Exercise Install Open Refine Tutorial: Cleaning data with Open Refine
BCHB697 - Edwards
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.