Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal Product Manager Adam Gray, Senior Product Manager
Innovation #1:
42
Building your own virtual programmable datacenter
ec2-run-instances
On Demand Global Infrastructure
Programmable
Elastic
Instance Types
Standard (m1) High Memory (m2) High CPU (c1)
High Performance
“Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds.” Mike Driscoll – Meta Markets
Cluster Computing
MPI
Bandwidth Intensive
Cluster Compute Instance
2*Intel Xeon Cores w/ HT 23 GB RAM 1.7 TB disk HVM Cc1.4xlarge
linpack
231 November 2010
451 June 2011
New Cluster Compute Instances
2*Intel Xeon 16 cores w/HT 60.5GB RAM 3.4TB disk HVM cc2.8xlarge
linpack
42 November 2011
Innovation #2:
Lowering the cost of developing a distributed system
Case Study: Amazon’s Associates Program
Text Links Enhanced Links
how much to pay each associate?
orders c++ app bi-hourly flat files bi-hourly flat files
orders c++ app bi-hourly flat files bi-hourly flat files c++ app daily aggregations daily aggregations
orders c++ app bi-hourly flat files bi-hourly flat files c++ app daily aggregations daily aggregations c++ app to payments service…
orders c++ app bi-hourly flat files bi-hourly flat files c++ app daily aggregations daily aggregations c++ app to payments service…
“just one more Q4”
distributed computing
Difficulty Number of Machines 1 1
Difficulty Number of Machines
Difficulty Number of Machines
distributed computing is hard
distributed computing requires god-like engineers
Hadoop is… The MapReduce computational paradigm
Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
PersonStartEnd Bob00:44:4800:45:11 Charlie02:16:0202:16:18 Charlie11:16:5911:17:17 Charlie11:17:2411:17:38 Bob11:23:1011:23:25 Alice16:26:4616:26:54 David17:20:2817:20:45 Alice18:16:5318:17:00 Charlie19:33:4419:33:59 Bob21:13:3221:13:43 David22:36:2222:36:34 Alice23:42:0123:42:11
PersonStartEndDuration Bob00:44:4800:45:11 Charlie02:16:0202:16:18 Charlie11:16:5911:17:17 Charlie11:17:2411:17:38 Bob11:23:1011:23:25 Alice16:26:4616:26:54 David17:20:2817:20:45 Alice18:16:5318:17:00 Charlie19:33:4419:33:59 Bob21:13:3221:13:43 David22:36:2222:36:34 Alice23:42:0123:42:11
PersonStartEndDuration Bob00:44:4800:45:1123 Charlie02:16:0202:16:18 Charlie11:16:5911:17:17 Charlie11:17:2411:17:38 Bob11:23:1011:23:25 Alice16:26:4616:26:54 David17:20:2817:20:45 Alice18:16:5318:17:00 Charlie19:33:4419:33:59 Bob21:13:3221:13:43 David22:36:2222:36:34 Alice23:42:0123:42:11
PersonStartEndDuration Bob00:44:4800:45:1123 Charlie02:16:0202:16:1816 Charlie11:16:5911:17:17 Charlie11:17:2411:17:38 Bob11:23:1011:23:25 Alice16:26:4616:26:54 David17:20:2817:20:45 Alice18:16:5318:17:00 Charlie19:33:4419:33:59 Bob21:13:3221:13:43 David22:36:2222:36:34 Alice23:42:0123:42:11
PersonStartEndDuration Bob00:44:4800:45:1123 Charlie02:16:0202:16:1816 Charlie11:16:5911:17:1718 Charlie11:17:2411:17:3814 Bob11:23:1011:23:2515 Alice16:26:4616:26:548 David17:20:2817:20:4517 Alice18:16:5318:17:007 Charlie19:33:4419:33:5915 Bob21:13:3221:13:4311 David22:36:2222:36:3412 Alice23:42:0123:42:1110
PersonDuration Bob23 Charlie16 Charlie18 Charlie14 Bob15 Alice8 David17 Alice7 Charlie15 Bob11 David12 Alice10
PersonDuration Bob23 Charlie16 Charlie18 Charlie14 Bob15 Alice8 David17 Alice7 Charlie15 Bob11 David12 Alice10 PersonStartEnd Bob00:44:4800:45:11 Charlie02:16:0202:16:18 Charlie11:16:5911:17:17 Charlie11:17:2411:17:38 Bob11:23:1011:23:25 Alice16:26:4616:26:54 David17:20:2817:20:45 Alice18:16:5318:17:00 Charlie19:33:4419:33:59 Bob21:13:3221:13:43 David22:36:2222:36:34 Alice23:42:0123:42:11 map
PersonDuration Bob23 Charlie16 Charlie18 Charlie14 Bob15 Alice8 David17 Alice7 Charlie15 Bob11 David12 Alice10
PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17
PersonTotal Alice25 PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17
PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17 PersonTotal Bob49 Alice25
PersonTotal Charlie63 Bob49 Alice25 PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17
PersonTotal David29 Charlie63 Bob49 Alice25 PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17
PersonTotal David29 Charlie63 Bob49 Alice25
PersonTotal Alice25 Bob49 Charlie63 David29 PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17 reduce
PersonStartEnd Bob00:44:4800:45:11 Charlie02:16:0202:16:18 Charlie11:16:5911:17:17 Charlie11:17:2411:17:38 Bob11:23:1011:23:25 Alice16:26:4616:26:54 David17:20:2817:20:45 Alice18:16:5318:17:00 Charlie19:33:4419:33:59 Bob21:13:3221:13:43 David22:36:2222:36:34 Alice23:42:0123:42:11
PersonDuration Alice Bob23 Bob15 Bob11 Charlie16 Charlie18 Charlie14 Charlie15 David12 David17
Hadoop is… The MapReduce computational paradigm
Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
distributed computing requires god-like engineers
distributed computing (with Hadoop) requires god-like talented engineers
how much to pay each associate?
orders c++ app bi-hourly flat files bi-hourly flat files c++ app daily aggregations daily aggregations c++ app to payments service…
orders c++ app bi-hourly flat files bi-hourly flat files c++ app daily aggregations daily aggregations c++ app to payments service… PersonTotal Alice25 Bob49 Charlie63 David29
Orders Filter S3 Other Services
Orders Filter S3 Hadoop Cluster
Difficulty Number of Machines
Difficulty Number of Machines More data? Smarter engineers.
Difficulty Number of Machines
Difficulty Number of Machines More data? Smarter Engineers. More data? More boxes.
Hadoop lowers the cost of developing a distributed system.
What about the cost of operating a distributed system?
November traffic at amazon.com
76% 24%
Orders Filter S3 Hadoop Cluster
Amazon Elastic Compute Cloud “provides resizable compute capacity in the cloud.”
Amazon Elastic MapReduce = Amazon EC2 + Hadoop
Orders Filter S3 Hadoop Cluster
Filter S3 EMR Cluster Orders
Filter S3 EMR Cluster Orders
Filter S3 Orders
Filter S3 Orders
Amazon EC2 lowers the cost of operating a distributed system.
Hadoop lowers the cost of developing a distributed system.
Amazon Elastic MapReduce changes the economics of data processing.
Managed Apache Hadoop Service Removes MUCK from Big Data processing Provides tight integration with AWS services AMAZON ELASTIC MAPREDUCE
> elastic-mapreduce --create --instance-type m1.large / --instance-count name “My Hadoop Cluster” / --jar s3://elasticmapreduce/samples/cloudburst/cloudburst.jar
What is big data?
Dataset size Number of datasets
Dataset size Number of datasets fits on a single machine
Dataset size Number of datasets Big Data
Dataset size Number of datasets Extremely Big Data
Dataset size Difficulty
Dataset size Difficulty
Dataset size Difficulty Extremely valuable Marginally valuable
Dataset size Difficulty Extremely valuable Marginally valuable
Dataset size Number of datasets Extremely Big Data
Dataset size Difficulty
Dataset size Difficulty
Dataset size Difficulty
Dataset size Difficulty
cheap experimentation
Innovation #3:
Lowering the cost of accessing data
Over 50 free data sets
Nearly 1 PB of free data
Stored at no cost to providers; also free access to consumers
1000 Genomes Project (110 TB) Common Crawl Corpus (60 TB) Sloan Digital Sky Survey (180 GB) United States Census (200 GB) Million Song Dataset (500 GB) Google Books Corpus (2.2 TB) Marvel Universe Social Graph (50 GB)
aws.amazon.com/datasets
Innovation #4: Creating a Market for Capacity
Finding Research Dollars (even further) for AWS
Educators
Up to $100 per Student in AWS Credits for intro courses
Researchers
Infrastructure Credits (EC2, S3, …)
4 Grant Review Cycles Per Year
February 10, 2012
Students
Student Organizations, Self Learning, Entrepreneurial Projects
aws.amazon.com/education
Stretching your Research Dollars (even further) on AWS
On-Demand
Reserved
Spot
Unused EC2 Capacity
Bid
July 2011
Interruption
July 2011
Manage Interruption
Grid Computing
MIT StarCluster
Harvard Medical School Lab of Personalized Medicine
Temple University Spot MPI
Elastic MapReduce
#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 Allocate 4 instances Job Flow 14 Hours Duration: #2: Cost with Spot 4 instances *7 hrs * $0.50 = $ instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Scenario #1 Add 5 Spot Instances Duration: Job Flow 7 Hours Scenario #2 Time Savings: 50% Cost Savings: ~22% Save Time and Money
Queue Based Architecture Amazon EC2 Spot Amazon EC2 On-Demand / Reserved Queue Applications
Checkpointing
30,000+ Cores 95,078 Instance Hours
$1,279/hour
We are Hiring! FT/Interns: amazon.com/careers Experienced: aws.amazon.com/jobs