Download presentation
Presentation is loading. Please wait.
Published byMadeline Lewis Modified over 5 years ago
1
Data Pipeline Best Practices for an Increasingly Cloudy World
Adam Machanic SQL Saturday Boston 2019
2
Adam Machanic SQL Saturday Boston 2019
Data Pipeline Best Practices Architecture for an Increasingly Cloudy World Adam Machanic SQL Saturday Boston 2019
3
Adam Machanic SQL Saturday Boston 2019
ETL Data Pipeline Best Practices Architecture for an Increasingly Cloudy World Adam Machanic SQL Saturday Boston 2019
4
Adam Machanic SQL Saturday Boston 2019
ETL Data Pipeline Best Practices Architecture for an Increasingly Cloudy World Overpriced, Underpowered Servers Adam Machanic SQL Saturday Boston 2019
5
ADAM MACHANIC A BRIEF TIMELINE
Shifted Focus to OSS Early Stuff The SQL Years Birth Discovered SQL Server SQL MVP Contact: SQL Saturday Boston
6
WHY MIGRATE TO “THE CLOUD?”
Reduce Spending Eliminate Data Center Costs Reduce Management Overhead CapEx Becomes OpEx Improve Scalability! Decrease Deployment Time Infrastructure As Code Because That’s What We’re Told To Do The CTO mandated that we migrate.
7
THE CLOUD, PROPERLY DEFINED
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. - National Institute of Standards and Technology “ ”
8
THE 6 R’S OF CLOUD MIGRATION
RETIRE RE-PURCHASE RETAIN RE-HOST (a.k.a. “lift-and-shift”) (a.k.a. “someone else’s server”) RE-PLATFORM REFACTOR
9
LIFT-AND-SHIFT: SERVER COST REDUCTION?
Dell PowerEdge R740xd2 16 cores (2.3 GHz), 64 GB RAM One-Time Retail Cost: $7,803 Data Center Est.: $50-300/mo 3 years… Maximum TCO: $18,603 Realistic* TCO: $9,842 *20% Server Discount $100/mo Data Center Amazon AWS EC2 m4.4xlarge 16 cores (2.4 GHz), 64 GB RAM $0.80 /hour $19.20 /day $7,008 /year $21,024 /3 years 100% CapEx CapEx OpEx
10
CLOUD REFACTORING CONSIDERATIONS
Everything is a remote network service. FAILURES HAPPEN. A LOT. Availability is less important than scalability. FAILURES HAPPEN. A LOT. Availability and elasticity are more important than raw performance. THE “OLD WAY” MIGHT SEEM SLOW. RETRY LOGIC PROCESS RE-ENTRY AND IDEMPOTENCY
11
LINCHPIN CLOUD-NATIVE SERVICES
Queuing and Messaging LOB Storage Transient Computing Infrastructure
12
QUEUES FOR THE WIN GATING AND EFFICIENCY
13
QUEUES FOR THE WIN ROUTING, DECOUPLING, SCALE CONTROL
FILES SERVICE FILES SERVICE QUEUE
14
QUEUES FOR THE WIN HARDENING AND RETRY
Hardening of Work Item Information (i.e. small packets of metadata, not actual work) Visibility and Timeouts One-at-a-Time Delivery Timeout? Return the Item to the Queue
15
SCALABLE STORAGE, THE LOB WAY
Optimized for Scale and Availability NOT NECESSARILY OPTIMIZED FOR RAW PERFORMANCE Consider: Network Latency vs. Throughput Eventually Consistent, Maybe
16
LATENCY MATTERS! TEST: WRITE 1,000,000,000 BYTES: LOCAL VS S3
10,000 x 100,000 BYTES LOCAL 5s S3 890s 1,000 x 1,000,000 BYTES LOCAL 5s S3 112s 100 x 10,000,000 BYTES LOCAL 5s S3 32s 10 x 100,000,000 BYTES LOCAL 5s S3 13s
17
… BUT LOB STORAGE CAN SCALE! 1,000,000,000 BYTES USING 64 THREADS
10,000 x 100,000 BYTES LOCAL 11s S3 23s 1,000 x 1,000,000 BYTES LOCAL 6s S3 7s 100 x 10,000,000 BYTES LOCAL 6s S3 5s 10 x 100,000,000 BYTES LOCAL 8s S3 2s
18
LOB REFACTOR: STORAGE COST REDUCTION? 50 TB OF DATA
Pure Storage 57.2 TB FlashArray One-Time Retail Cost: $349,700 + Data Center Costs Amazon S3, 50 TB Storage ($0.023/GB/month) $1,150/month; $13,800/year; $41,400/3 years Transfer (10 TB/month, US East) ($0.01/GB/month) $100/month; $1200/year; $3600/3 years Operations Writes (1,000,000/month) ($0.005/1000) == $5.00/month; $60.00/year; $180/3 years Reads (10,000,000/month) ($0.0004/1000) $4.00/month; $48.00/year; $144/3 years Total: $45,324
19
TRANSIENT COMPUTING SERVERLESS RESOURCES Resources Appear on Demand
Resources Disappear When Demand Ends Server-Oriented or “Serverless” SERVERLESS RESOURCES No Server to Manage 4x-10x More Expensive Per Cycle Can be Slower Than Server-Oriented Resources
20
LEGACY MONOLITH ELT ARCHITECTURE
BEGIN TRANSACTION; UPDATE DIM TABLE 1; UPDATE DIM TABLE 2; UPDATE DIM TABLE 3; UPDATE FACT TABLE 1; UPDATE FACT TABLE 2; … COMMIT; INTEGRATION SERVER Basic Transformation More Transformation File Watcher File Store Staging Database Destination Database
21
ACTUALLY BENEFIT FROM CLOUD SERVICES!
LET’S REFACTOR! ACTUALLY SAVE MONEY! ACTUALLY BENEFIT FROM CLOUD SERVICES! MAKE IT SCALE!
22
HOW TO SCALE? Throttling REALITY
Cloud Services Appear to Have Endless Resources REALITY They Have Way More Servers Than You (But Not Infinite) Throttling Queuing Eventual Consistency
23
TO SCALE WE MUST BUILD ON SCALE
Local Hard Drive Network Attached Storage Device Cloud Provider LOB Storage An Average Database Cloud Provider Queue Cloud Provider Serverless Offering
24
A CLOUD-NATIVE PIPELINE TEMPLATE CHEAP, SCALABLE, AND (RELATIVELY) FAIL-SAFE
Initial File Container Destination Once Per Target Set Serverless Event Trigger Initial Work Item Queue Transient Worker(s) Intermediate Results File Container Transient Worker Secondary Transformation Work Item Queue Basic Transformation File Container Serverless Event Trigger
25
SUMMARY Re-hosting in the Cloud is Probably a Waste of Time and Money
(Don’t Tell Your CTO) Refactoring in the Cloud Brings a Variety of Benefits Building on Highly Scalable Components Yields a Highly Scalable End Result
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.