Scientific Computing on Amazon Web Services Dave Cuthbert Solutions Architect
Two Facets (That I’ll Mention Today) Facet 1: Availability of scientific applications General purpose analysis Python (SciPy, NumPy, iPython notebooks). Octave, R, … C, C++, Fortran, … Databases/data formats NetCDF, HDF, … Cassandra, MongoDB, CouchDB, Redis, Berkeley DB, … MySQL/MariaDB, PostgreSQL, … Commercial Applications are widely available. Licensing can be thorny.
Two Facets (That I’ll Mention Today) Facet 2: Cycles What everyone thinks: HPC. Mental trap 1: It’s not “real” science if it’s not running on an HPC cluster. Mental trap 2: If your lab has an HPC cluster, you should be coding for it. So everyone demands cluster time, and…
A Typical HPC Cluster Workload
But What Is HPC, Anyway? If I wanted to start a flame war: “What is ‘real’ HPC?”
HPC Is Not A Panacea! Hadoop GPU Low Latency Hadoop Low Latency GPU
It’s A Trap! Facet 2: Cycles What everyone thinks: HPC. Mental trap 1: It’s not “real” science if it’s not running on an HPC cluster. Mental trap 2: If your lab has an HPC cluster, you should be coding for it. The right systems for the job.
HOW AWS IS ATTACKING THE PROBLEM
AmazonLinux with SLURM AMI Availability Zone us-west-2a Region: us-west-2 (Oregon) controller VPC Space: /16 Subnet /24 Availability Zone us-west-2b Subnet /24 Availability Zone us-west-2c Subnet /24 node-0 node-1 node-2 node-3 node-4 node-5 node-6 node-7 node-8 node-9 node-10 node-11 VBL S3 Bucket Scripts Code Input Decks Output Files CloudFormation Template Internet gateway Work Request Queue Work Response Queue SQS Queues CloudFormation (Bootstrap controller)
AmazonLinux with SLURM AMI Availability Zone us-west-2a Region: us-west-2 (Oregon) controller VPC Space: /16 Subnet /24 Availability Zone us-west-2b Subnet /24 Availability Zone us-west-2c Subnet /24 node-0 node-1 node-2 node-3 node-4 node-5 node-6 node-7 node-8 node-9 node-10 node-11 VBL S3 Bucket Scripts Code Input Decks Output Files CloudFormation Template Internet gateway Work Request Queue Work Response Queue SQS Queues CloudFormation (Bootstrap controller)
AmazonLinux with SLURM AMI Availability Zone us-west-2a Region: us-west-2 (Oregon) controller VPC Space: /16 Subnet /24 Availability Zone us-west-2b Subnet /24 Availability Zone us-west-2c Subnet /24 node-0 node-1 node-2 node-3 node-4 node-5 node-6 node-7 node-8 node-9 node-10 node-11 VBL S3 Bucket Scripts Code Input Decks Output Files CloudFormation Template Internet gateway Work Request Queue Work Response Queue SQS Queues CloudFormation (Bootstrap controller)
AmazonLinux with SLURM AMI Availability Zone us-west-2a Region: us-west-2 (Oregon) controller VPC Space: /16 Subnet /24 Availability Zone us-west-2b Subnet /24 Availability Zone us-west-2c Subnet /24 node-0 node-1 node-2 node-3 node-4 node-5 node-6 node-7 node-8 node-9 node-10 node-11 VBL S3 Bucket Scripts Code Input Decks Output Files CloudFormation Template Internet gateway Work Request Queue Work Response Queue SQS Queues CloudFormation (Bootstrap controller) min229 µs p50239 µs p90258 µs p99280 µs max472 µs min229 µs p50239 µs p90258 µs p99280 µs max472 µs min329 µs p50340 µs p90354 µs p99377 µs max611 µs min329 µs p50340 µs p90354 µs p99377 µs max611 µs min1048 µs p µs p µs p µs max2125 µs min1048 µs p µs p µs p µs max2125 µs
AmazonLinux with SLURM AMI Availability Zone us-west-2a Region: us-west-2 (Oregon) controller VPC Space: /16 Subnet /24 Availability Zone us-west-2b Subnet /24 Availability Zone us-west-2c Subnet /24 node-0 node-1 node-2 node-3 node-4 node-5 node-6 node-7 node-8 node-9 node-10 node-11 VBL S3 Bucket Scripts Code Input Decks Output Files CloudFormation Template Internet gateway Work Request Queue Work Response Queue SQS Queues CloudFormation (Bootstrap controller)
AmazonLinux with SLURM AMI Availability Zone us-west-2a Region: us-west-2 (Oregon) controller VPC Space: /16 Subnet /24 node-0node-1node-2node-3node-4node-5node-6node-7node-8node-9node-10node-11 VBL S3 Bucket Scripts Code Input Decks Output Files CloudFormation Template Internet gateway Work Request Queue Work Response Queue SQS Queues CloudFormation (Bootstrap controller) Placement Group A min85 µs p5096 µs p90106 µs p99189 µs max233 µs min85 µs p5096 µs p90106 µs p99189 µs max233 µs min87 µs p5099 µs p90174 µs p99189 µs max246 µs min87 µs p5099 µs p90174 µs p99189 µs max246 µs
Is AWS The Silver Bullet? No silver bullets – Fred Brooks Commonly heard latency number: 10 µs Proximity to other resources might be an issue. People-hours are more expensive than core- hours. Enable facilities like NERSC to focus on harder problems not served (or currently served) by COTS.
THANK YOU!