Download presentation
Presentation is loading. Please wait.
Published byBrianna Cross Modified over 6 years ago
1
Taming Data Logistics: The Hardest Part of Data Science
handed a ticket Taming Data Logistics: The Hardest Part of Data Science February 1, 2011 Ken Farmer
2
Data Logistics is the Management of Data in Motion
While handling problems And all of this for many, many feeds
3
Top 3 Data Logistics Problems
Images: Cabling nightmare: Car problem: Dependability Problems Productivity Problems Data Quality problems
4
Data Logistics is Surprisingly Difficult
It's not sexy Lack of dramatic improvement Few best practices Non-intuitive challenges Little-known tools & methods Tools != methods Image:
5
Data Science is not alone
Similar activities Similar technologies SImilar challenges Similar results Different heritages Nothing on the Data Science side maps to ETL Corporate DBA Heritage Academic CS Heritage
6
Top 3 Fundamental Mistakes
Second: Third: First: Images: Comsistency Vs Adaptability (or incorrect requirements & objectives) Non-Linear Scalability Problems (or misunderstood dynamics) Magical Thinking (or over-estimated capabilities)
7
Architectural Decisions – Buy vs Build
Considerations Include: Feed Complexity & Number (see left) Developer Interest & skills Organizational culture It's really (buy+build) vs build ^ | Risk Complexity ->
8
Architectural Decisions – Scheduling & Control
<- Synchronous Steps Asynchronous Stations -> Unit of Work: Batches Microbatches Streaming Chain: Assembly line:
9
Stages ETL are both activities AND stages
Enables deployment flexibility Enables different tools & technology Adds structure to process
10
Extract Stage Get transformation-ready data Changed data capture
Minimal transformation Auditing Potential Colocation
11
Transformation Stage Heavy transformations: Lookups Validations
Remapping Business Rules Heavy Auditing Post-transform delta-processing
12
Load Stage Speed vs Concurrency Double-duty as backup/recovery
Auditing Delta-processing Insert vs Insert-Update vs Replace
13
File Transportation Process
Autonomous utility allows components to “fire & forget” Like rsynch but with pre & post actions – for renaming or moving files Automatically retries failures Commodity Interface Alternative: network file system
14
Metadata Active Documentation that drives: Integration Automation
May include the Audit Subsystem: Process-audit results Rule-audit results
15
ID Generation Especially important for relational databases Options:
Reuse source ids (don't do this) Assigned In database Assigned In ETL Consider recoverability Image:
16
Recap These problems will happen: Unless you avoid these mistakes:
Productivity issues Data quality problems Dependability issues Unless you avoid these mistakes: Non-linear scalability Magical thinking Wrong consistency vs adaptability trade-offs And pick the right architecture: ETL tool for large number of simple feeds in the corporate world. Custom solution for small number of complex feeds in the startup world – if you have a great team. Think carefully about the grey area in-between. Finally, stick to a consistent extract, transform, load breakdown whenever possible – for best maintenance and adaptability.
17
Thanks to the following:
Old Shoe: Cabling Nightmare: Auto Repair: Chain: Assembly line: Cottingl y Faeries: Scale: Giant Gnome: Ticket:
18
Ken Farmer has twenty years of experience in delivering innovations through data logistics: the unglamorous part of data science involved in acquiring, standardizing, validating, transforming, integrating, and enabling the availability and access to vast amounts of data. Ken is a senior data architect at IBM where he leads their security & compliance data warehouse. Prior to this role Ken consulted on search engines and data warehouses in the insurance, telecom, entertainment, and retail industries.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.