Taming Data Logistics: The Hardest Part of Data Science handed a ticket Taming Data Logistics: The Hardest Part of Data Science February 1, 2011 Ken Farmer kenfar@us.ibm.com kenfar@gmail.com
Data Logistics is the Management of Data in Motion While handling problems And all of this for many, many feeds
Top 3 Data Logistics Problems Images: Cabling nightmare: http://www.flickr.com/photos/alq666/2248613780/sizes/z/in/photostream/ Car problem: http://www.flickr.com/photos/michelhrv/2545226437/sizes/z/in/photostream/ Dependability Problems Productivity Problems Data Quality problems
Data Logistics is Surprisingly Difficult It's not sexy Lack of dramatic improvement Few best practices Non-intuitive challenges Little-known tools & methods Tools != methods Image: http://www.flickr.com/photos/freefoto/3037402633/sizes/o/in/photostream/
Data Science is not alone Similar activities Similar technologies SImilar challenges Similar results Different heritages Nothing on the Data Science side maps to ETL Corporate DBA Heritage Academic CS Heritage
Top 3 Fundamental Mistakes Second: Third: First: Images: http://www.google.com/imgres?q=fairies+cottingley&hl=en&gbv=2&biw=1280&bih=575&tbm=isch&tbnid=yTeYdw-LVEELiM:&imgrefurl=http://www.unmuseum.org/fairies.htm&docid=DIOvMkBlWc67GM&w=401&h=267&ei=Unh8Tt_AH4ro0QHNmOQH&zoom=1&iact=hc&vpx=968&vpy=227&dur=7451&hovh=183&hovw=275&tx=152&ty=69&page=1&tbnh=159&tbnw=218&start=0&ndsp=11&ved=1t:429,r:4,s:0 http://www.flickr.com/photos/sugarcubevintage/5663606146/ http://www.flickr.com/photos/tobyleah/2897960941/sizes/z/in/photostream/ Comsistency Vs Adaptability (or incorrect requirements & objectives) Non-Linear Scalability Problems (or misunderstood dynamics) Magical Thinking (or over-estimated capabilities)
Architectural Decisions – Buy vs Build Considerations Include: Feed Complexity & Number (see left) Developer Interest & skills Organizational culture It's really (buy+build) vs build ^ | Risk Complexity ->
Architectural Decisions – Scheduling & Control <- Synchronous Steps Asynchronous Stations -> Unit of Work: Batches Microbatches Streaming Chain: http://www.flickr.com/photos/pratanti/5359581911/sizes/z/in/photostream/ Assembly line: http://www.flickr.com/photos/gblakeley/5583120966/sizes/z/in/photostream/
Stages ETL are both activities AND stages Enables deployment flexibility Enables different tools & technology Adds structure to process
Extract Stage Get transformation-ready data Changed data capture Minimal transformation Auditing Potential Colocation
Transformation Stage Heavy transformations: Lookups Validations Remapping Business Rules Heavy Auditing Post-transform delta-processing
Load Stage Speed vs Concurrency Double-duty as backup/recovery Auditing Delta-processing Insert vs Insert-Update vs Replace
File Transportation Process Autonomous utility allows components to “fire & forget” Like rsynch but with pre & post actions – for renaming or moving files Automatically retries failures Commodity Interface Alternative: network file system
Metadata Active Documentation that drives: Integration Automation May include the Audit Subsystem: Process-audit results Rule-audit results
ID Generation Especially important for relational databases Options: Reuse source ids (don't do this) Assigned In database Assigned In ETL Consider recoverability Image: http://www.flickr.com/photos/rrrrred/2686239220/sizes/z/in/photostream/
Recap These problems will happen: Unless you avoid these mistakes: Productivity issues Data quality problems Dependability issues Unless you avoid these mistakes: Non-linear scalability Magical thinking Wrong consistency vs adaptability trade-offs And pick the right architecture: ETL tool for large number of simple feeds in the corporate world. Custom solution for small number of complex feeds in the startup world – if you have a great team. Think carefully about the grey area in-between. Finally, stick to a consistent extract, transform, load breakdown whenever possible – for best maintenance and adaptability.
Thanks to the following: Old Shoe: http://www.flickr.com/photos/freefoto/3037402633/sizes/o/in/photostream/ Cabling Nightmare: http://www.flickr.com/photos/alq666/2248613780/sizes/z/in/photostream/ Auto Repair: http://www.flickr.com/photos/michelhrv/2545226437/sizes/z/in/photostream/ Chain: http://www.flickr.com/photos/pratanti/5359581911/sizes/z/in/photostream/ Assembly line: http://www.flickr.com/photos/gblakeley/5583120966/sizes/z/in/photostream/ Cottingl y Faeries:http://www.google.com/imgres?q=fairies+cottingley&hl=en&gbv=2&biw=1280&bih=575&tbm=isch&tbnid=yTeYdw-LVEELiM:&imgrefurl=http://www.unmuseum.org/fairies.htm&docid=DIOvMkBlWc67GM&w=401&h=267&ei=Unh8Tt_AH4ro0QHNmOQH&zoom=1&iact=hc&vpx=968&vpy=227&dur=7451&hovh=183&hovw=275&tx=152&ty=69&page=1&tbnh=159&tbnw=218&start=0&ndsp=11&ved=1t:429,r:4,s:0 Scale: http://www.flickr.com/photos/sugarcubevintage/5663606146/ Giant Gnome: http://www.flickr.com/photos/tobyleah/2897960941/sizes/z/in/photostream/ Ticket: http://www.flickr.com/photos/rrrrred/2686239220/sizes/z/in/photostream/
Ken Farmer has twenty years of experience in delivering innovations through data logistics: the unglamorous part of data science involved in acquiring, standardizing, validating, transforming, integrating, and enabling the availability and access to vast amounts of data. Ken is a senior data architect at IBM where he leads their security & compliance data warehouse. Prior to this role Ken consulted on search engines and data warehouses in the insurance, telecom, entertainment, and retail industries.