Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Pipeline Best Practices for an Increasingly Cloudy World

Similar presentations


Presentation on theme: "Data Pipeline Best Practices for an Increasingly Cloudy World"— Presentation transcript:

1 Data Pipeline Best Practices for an Increasingly Cloudy World
Adam Machanic SQL Saturday Boston 2019

2 Adam Machanic SQL Saturday Boston 2019
Data Pipeline Best Practices Architecture for an Increasingly Cloudy World Adam Machanic SQL Saturday Boston 2019

3 Adam Machanic SQL Saturday Boston 2019
ETL Data Pipeline Best Practices Architecture for an Increasingly Cloudy World Adam Machanic SQL Saturday Boston 2019

4 Adam Machanic SQL Saturday Boston 2019
ETL Data Pipeline Best Practices Architecture for an Increasingly Cloudy World Overpriced, Underpowered Servers Adam Machanic SQL Saturday Boston 2019

5 ADAM MACHANIC A BRIEF TIMELINE
Shifted Focus to OSS Early Stuff The SQL Years Birth Discovered SQL Server SQL MVP Contact: SQL Saturday Boston

6 WHY MIGRATE TO “THE CLOUD?”
Reduce Spending Eliminate Data Center Costs Reduce Management Overhead CapEx Becomes OpEx Improve Scalability! Decrease Deployment Time Infrastructure As Code Because That’s What We’re Told To Do The CTO mandated that we migrate.

7 THE CLOUD, PROPERLY DEFINED
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. - National Institute of Standards and Technology

8 THE 6 R’S OF CLOUD MIGRATION
RETIRE RE-PURCHASE RETAIN RE-HOST (a.k.a. “lift-and-shift”) (a.k.a. “someone else’s server”) RE-PLATFORM REFACTOR

9 LIFT-AND-SHIFT: SERVER COST REDUCTION?
Dell PowerEdge R740xd2 16 cores (2.3 GHz), 64 GB RAM One-Time Retail Cost: $7,803 Data Center Est.: $50-300/mo 3 years… Maximum TCO: $18,603 Realistic* TCO: $9,842 *20% Server Discount $100/mo Data Center Amazon AWS EC2 m4.4xlarge 16 cores (2.4 GHz), 64 GB RAM $0.80 /hour $19.20 /day $7,008 /year $21,024 /3 years 100% CapEx CapEx OpEx

10 CLOUD REFACTORING CONSIDERATIONS
Everything is a remote network service. FAILURES HAPPEN. A LOT. Availability is less important than scalability. FAILURES HAPPEN. A LOT. Availability and elasticity are more important than raw performance. THE “OLD WAY” MIGHT SEEM SLOW. RETRY LOGIC PROCESS RE-ENTRY AND IDEMPOTENCY

11 LINCHPIN CLOUD-NATIVE SERVICES
Queuing and Messaging LOB Storage Transient Computing Infrastructure

12 QUEUES FOR THE WIN GATING AND EFFICIENCY

13 QUEUES FOR THE WIN ROUTING, DECOUPLING, SCALE CONTROL
FILES SERVICE FILES SERVICE QUEUE

14 QUEUES FOR THE WIN HARDENING AND RETRY
Hardening of Work Item Information (i.e. small packets of metadata, not actual work) Visibility and Timeouts One-at-a-Time Delivery Timeout? Return the Item to the Queue

15 SCALABLE STORAGE, THE LOB WAY
Optimized for Scale and Availability NOT NECESSARILY OPTIMIZED FOR RAW PERFORMANCE Consider: Network Latency vs. Throughput Eventually Consistent, Maybe

16 LATENCY MATTERS! TEST: WRITE 1,000,000,000 BYTES: LOCAL VS S3
10,000 x 100,000 BYTES LOCAL 5s S3 890s 1,000 x 1,000,000 BYTES LOCAL 5s S3 112s 100 x 10,000,000 BYTES LOCAL 5s S3 32s 10 x 100,000,000 BYTES LOCAL 5s S3 13s

17 … BUT LOB STORAGE CAN SCALE! 1,000,000,000 BYTES USING 64 THREADS
10,000 x 100,000 BYTES LOCAL 11s S3 23s 1,000 x 1,000,000 BYTES LOCAL 6s S3 7s 100 x 10,000,000 BYTES LOCAL 6s S3 5s 10 x 100,000,000 BYTES LOCAL 8s S3 2s

18 LOB REFACTOR: STORAGE COST REDUCTION? 50 TB OF DATA
Pure Storage 57.2 TB FlashArray One-Time Retail Cost: $349,700 + Data Center Costs Amazon S3, 50 TB Storage ($0.023/GB/month) $1,150/month; $13,800/year; $41,400/3 years Transfer (10 TB/month, US East) ($0.01/GB/month) $100/month; $1200/year; $3600/3 years Operations Writes (1,000,000/month) ($0.005/1000) == $5.00/month; $60.00/year; $180/3 years Reads (10,000,000/month) ($0.0004/1000) $4.00/month; $48.00/year; $144/3 years Total: $45,324

19 TRANSIENT COMPUTING SERVERLESS RESOURCES Resources Appear on Demand
Resources Disappear When Demand Ends Server-Oriented or “Serverless” SERVERLESS RESOURCES No Server to Manage 4x-10x More Expensive Per Cycle Can be Slower Than Server-Oriented Resources

20 LEGACY MONOLITH ELT ARCHITECTURE
BEGIN TRANSACTION; UPDATE DIM TABLE 1; UPDATE DIM TABLE 2; UPDATE DIM TABLE 3; UPDATE FACT TABLE 1; UPDATE FACT TABLE 2; COMMIT; INTEGRATION SERVER Basic Transformation More Transformation File Watcher File Store Staging Database Destination Database

21 ACTUALLY BENEFIT FROM CLOUD SERVICES!
LET’S REFACTOR! ACTUALLY SAVE MONEY! ACTUALLY BENEFIT FROM CLOUD SERVICES! MAKE IT SCALE!

22 HOW TO SCALE? Throttling REALITY
Cloud Services Appear to Have Endless Resources REALITY They Have Way More Servers Than You (But Not Infinite) Throttling Queuing Eventual Consistency

23 TO SCALE WE MUST BUILD ON SCALE
Local Hard Drive Network Attached Storage Device Cloud Provider LOB Storage An Average Database Cloud Provider Queue Cloud Provider Serverless Offering

24 A CLOUD-NATIVE PIPELINE TEMPLATE CHEAP, SCALABLE, AND (RELATIVELY) FAIL-SAFE
Initial File Container Destination Once Per Target Set Serverless Event Trigger Initial Work Item Queue Transient Worker(s) Intermediate Results File Container Transient Worker Secondary Transformation Work Item Queue Basic Transformation File Container Serverless Event Trigger

25 SUMMARY Re-hosting in the Cloud is Probably a Waste of Time and Money
(Don’t Tell Your CTO) Refactoring in the Cloud Brings a Variety of Benefits Building on Highly Scalable Components Yields a Highly Scalable End Result


Download ppt "Data Pipeline Best Practices for an Increasingly Cloudy World"

Similar presentations


Ads by Google