Ian D. Alderman CycleComputing

Slides:

Advertisements

Similar presentations

How We Manage SaaS Infrastructure Knowledge Track

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Ivan Pleština Amazon Simple Storage Service (S3) Amazon Elastic Block Storage (EBS) Amazon Elastic Compute Cloud (EC2)

Ed Duguid with subject: MACE Cloud

Cloud Computing Mick Watson Director of ARK-Genomics The Roslin Institute.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Take your CMS to the cloud to lighten the load Brett Pollak Campus Web Office UC San Diego.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Ken Birman. Massive data centers We’ve discussed the emergence of massive data centers associated with web applications and cloud computing Generally.

HTCondor at Cycle Computing: Better Answers. Faster. Ben Cotton Senior Support Engineer.

Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 4.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.

Amazon EC2 Quick Start adapted from EC2_GetStarted.html.

Chapter-7 Introduction to Cloud Computing Cloud Computing.

CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.

Deploying and Managing Windows Server 2012

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Types of Operating System

What is Cloud Computing? Cloud computing is the delivery of computing capabilities as a service, making access to IT resources like compute power, networking.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Hands-On Microsoft Windows Server 2008

DMF Configuration for JCU HPC Dr. Wayne Mallett Systems Manager James Cook University.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

CS 1308 Computer Literacy and the Internet. Introduction  Von Neumann computer  “Naked machine”  Hardware without any helpful user-oriented features.

Ceph Storage in OpenStack Part 2 openstack-ch,

Introduction to Cloud Computing

Ian Alderman A Little History…

608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.

RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.

Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.

Condor and DRBL Bruno Gonçalves & Stefan Boettcher Emory University.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

ALMA Archive Operations Impact on the ARC Facilities.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

VMware vSphere Configuration and Management v6

Company small business cloud solution Client UNIVERSITY OF BEDFORDSHIRE.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Next Generation of Apache Hadoop MapReduce Owen

KAASHIV INFOTECH – A SOFTWARE CUM RESEARCH COMPANY IN ELECTRONICS, ELECTRICAL, CIVIL AND MECHANICAL AREAS

© 2015 MetricStream, Inc. All Rights Reserved. AWS server provisioning © 2015 MetricStream, Inc. All Rights Reserved. By, Srikanth K & Rohit.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.

HPC In The Cloud Case Study: Proteomics Workflow

HPC In The Cloud Case Study: Proteomics Workflow

Condor DAGMan: Managing Job Dependencies with Condor

Working With Azure Batch AI

Provisioning 160,000 cores with HEPCloud at SC17

Introduction to Operating System (OS)

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Introduction to Computers

AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.

Best Practices for Load Balancing Your GlobalSearch Installation

X in [Integration, Delivery, Deployment]

Haiyan Meng and Douglas Thain

Brandon Hixon Jonathan Moore

AWS Cloud Computing Masaki.

Presentation transcript:

Ian D. Alderman CycleComputing

Case Studies This is a presentation of some work we’ve recently completed with customers. This is a presentation of some work we’ve recently completed with customers. Focus on two particular customer workflows: Focus on two particular customer workflows: Part 1: 10,000 core cluster “Tanuki.” Part 1: 10,000 core cluster “Tanuki.” Part 2: Easily running the same workflow “locally” and in the cloud including data synchronization. Part 2: Easily running the same workflow “locally” and in the cloud including data synchronization. We will discuss the technical challenges. We will discuss the technical challenges.

Case 1: 10,000 Cores “Tanuki” Run time = 8 hours Run time = 8 hours 1.14 compute-years of computing executed every hour 1.14 compute-years of computing executed every hour Cluster Time = 80,000 hours = 9.1 compute years. Cluster Time = 80,000 hours = 9.1 compute years. Total run time cost = ~$8,500 Total run time cost = ~$8, c1.xlarge ec2 instances ( 8 cores / 7-GB RAM ) 1250 c1.xlarge ec2 instances ( 8 cores / 7-GB RAM ) 10,000 cores, 8.75 TB RAM, 2 PB of disk space 10,000 cores, 8.75 TB RAM, 2 PB of disk space Weighs in at number 75 of Top 500 SuperComputing list Weighs in at number 75 of Top 500 SuperComputing list Cost to run = ~ $1,060 / hour Cost to run = ~ $1,060 / hour

Customer Goals Genentech: “Examine how proteins bind to each other in research that may lead to medical treatments.” Genentech: “Examine how proteins bind to each other in research that may lead to medical treatments.” - Customer wants to test the scalability of CycleCloud: “Can we run 10,000 jobs at once?” Customer wants to test the scalability of CycleCloud: “Can we run 10,000 jobs at once?” Same workflow would take weeks or months on existing internal infrastructure. Same workflow would take weeks or months on existing internal infrastructure.

Customer Goals (cont) They can’t get answers to their scientific questions quickly enough on their internal cluster. They can’t get answers to their scientific questions quickly enough on their internal cluster. Need to know “What do we do next?” Need to know “What do we do next?” Genentech wanted to find out what Cycle can do: Genentech wanted to find out what Cycle can do: How much compute time can we get simultaneously? How much compute time can we get simultaneously? How much will it cost? How much will it cost?

Run Timeline 12:35 – 10,000 Jobs submitted and requests for batches cores are initiated 12:35 – 10,000 Jobs submitted and requests for batches cores are initiated 12:45 – 2,000 cores acquired 12:45 – 2,000 cores acquired 1:18 – 10,000 cores acquired 1:18 – 10,000 cores acquired 9:15 – Cluster shut down 9:15 – Cluster shut down

System Components Condor (& Workflow) Condor (& Workflow) Chef Chef CycleCloud custom CentOS AMIs CycleCloud custom CentOS AMIs CycleCloud.com CycleCloud.com AWS AWS

Technical Challenges: AWS Rare problems: Rare problems: If a particular problem occurs.4% of the time, if you run 1254 instances it will happen 5 times on average. If a particular problem occurs.4% of the time, if you run 1254 instances it will happen 5 times on average. Rare problem examples: Rare problem examples: DOA instances (unreachable). DOA instances (unreachable). Disk problems: can’t mount “ephemeral” disk. Disk problems: can’t mount “ephemeral” disk. Need chunking: “Thundering Herd” both for us and AWS. Need chunking: “Thundering Herd” both for us and AWS. Almost certainly will fill up availability zones. Almost certainly will fill up availability zones.

Technical Challenges: Chef Chef works by periodic client pulls. Chef works by periodic client pulls. If all the pulls occur at the same time, run out of resources. If all the pulls occur at the same time, run out of resources. That’s exactly what happens in our infrastructure when you start 1250 machines at once. That’s exactly what happens in our infrastructure when you start 1250 machines at once. Machines do ‘fast converge’ initially. Machines do ‘fast converge’ initially. So we need to use beefy servers, configure appropriately, and then stagger launches at a rate the server can handle. So we need to use beefy servers, configure appropriately, and then stagger launches at a rate the server can handle. This was the bottleneck in the system This was the bottleneck in the system Future work: pre-stage/cache chef results locally to reduce initial impact on the server. Future work: pre-stage/cache chef results locally to reduce initial impact on the server.

Technical Challenges: CycleCloud Implemented monitoring and detection of classes of rare problems. Implemented monitoring and detection of classes of rare problems. Batching of requests with delays between successive requests. Batching of requests with delays between successive requests. Testing: better to request 256 every 5 minutes or 128 every 2.5 minutes? What’s the best rate to make requests? Testing: better to request 256 every 5 minutes or 128 every 2.5 minutes? What’s the best rate to make requests? Set chunk size to 256 ec2 instances at a time Set chunk size to 256 ec2 instances at a time Did not overwhelm AWS/CycleCloud/Chef infrastructure Did not overwhelm AWS/CycleCloud/Chef infrastructure 2048 cores got job work stream running immediately 2048 cores got job work stream running immediately 1250 ec2 requests launched in 25 minutes ( 12:35am – 12:56 am Eastern Time ) 1250 ec2 requests launched in 25 minutes ( 12:35am – 12:56 am Eastern Time )

Technical Challenges: Images + FS Images not affected by scale of this run. Images not affected by scale of this run. Filesystem: large NFS server. Filesystem: large NFS server. We didn’t have problems with this run (we were worried!) We didn’t have problems with this run (we were worried!) Future work: parallel filesystem in the cloud. Future work: parallel filesystem in the cloud.

Technical Challenges: Condor …

Basic configuration changes. Basic configuration changes. See condor-wiki.cs.wisc.edu See condor-wiki.cs.wisc.edu Make sure user job log isn’t on NFS Make sure user job log isn’t on NFS Really a workflow problem… Really a workflow problem… Used 3 scheduler instances (could have used one) Used 3 scheduler instances (could have used one) 1 main scheduler 1 main scheduler 2 auxilary schedulers 2 auxilary schedulers Worked very well and handled the queue of 10,000 jobs just fine Worked very well and handled the queue of 10,000 jobs just fine Scheduler instance 1: 3333 jobs Scheduler instance 1: 3333 jobs Scheduler instance 2: 3333 jobs Scheduler instance 2: 3333 jobs Scheduler instance 3: 3334 jobs Scheduler instance 3: 3334 jobs Very nice, steady and even stream of condor job distribution ( see graph ) Very nice, steady and even stream of condor job distribution ( see graph )

So how did it go? Set up cluster using CycleCloud.com web interface. Set up cluster using CycleCloud.com web interface. All the customer had to do was submit jobs, go to bed, and check results in the morning. All the customer had to do was submit jobs, go to bed, and check results in the morning.

Cores used in Condor

Cores used Cluster Life-time

Condor Queue

Case 2: Synchronized Cloud Overflow Insurance companies often have periodic (quarterly) compute demand spikes. Want results ASAP but also don’t want to pay for hardware to sit idle the other 11+ weeks of the quarter. Insurance companies often have periodic (quarterly) compute demand spikes. Want results ASAP but also don’t want to pay for hardware to sit idle the other 11+ weeks of the quarter. Replicate internal filesystem to cloud. Replicate internal filesystem to cloud. Run jobs (on Windows). Run jobs (on Windows). Replicate results. Replicate results. /Users/ianalderman/Old_Desktop/oldD esktop/top/Condor Week/Condor Week ppt

Customer Goals Step outside the fixed cluster size/speed trade off. Step outside the fixed cluster size/speed trade off. Replicate internal Condor/Windows pipeline using shared filesystem. Replicate internal Condor/Windows pipeline using shared filesystem. Make it look the same to the user – minor UI change to run jobs in Cloud vs. local. Make it look the same to the user – minor UI change to run jobs in Cloud vs. local. Security policy constraints – only outgoing connections to cloud. Security policy constraints – only outgoing connections to cloud. Continue to monitor job progress using CycleServer. Continue to monitor job progress using CycleServer.

System components Condor (& Workflow) Condor (& Workflow) CycleServer CycleServer Chef Chef CycleCloud custom Windows AMIs / Filesystem CycleCloud custom Windows AMIs / Filesystem

Technical Challenges: Images and File System Customized Windows images (Condor). Customized Windows images (Condor). Init scripts a particular problem. Init scripts a particular problem. Machines reboot to get their hostname. Machines reboot to get their hostname. Configure SAMBA on Linux Condor scheduler. Configure SAMBA on Linux Condor scheduler.

Technical Challenges: Chef Implemented support for a number of features for Chef on Windows that were missing. Implemented support for a number of features for Chef on Windows that were missing. Lots of special case recipes because of the differences. Lots of special case recipes because of the differences. First restart caused problems. First restart caused problems.

Technical Challenges: Condor DAGMan submission has three stages. DAGMan submission has three stages. Ensure input gets transferred. Ensure input gets transferred. Submit actual work ( compute-hours) Submit actual work ( compute-hours) Transfer results Transfer results Cross platform submission. Cross platform submission.

Technical Challenges: CycleServer CycleServer is used: CycleServer is used: As a job submission point for cloud jobs. As a job submission point for cloud jobs. To monitor the progress of jobs. To monitor the progress of jobs. To monitor other components (Collect-L/Ganglia). To monitor other components (Collect-L/Ganglia). To coordinate file synchronization on both ends. To coordinate file synchronization on both ends. Custom plugins. Custom plugins. DAG jobs initiating file synchronization – wait until it completes. DAG jobs initiating file synchronization – wait until it completes. Plugins do push/pull from local to remote due to firewall. Plugins do push/pull from local to remote due to firewall.

Cloud vs. On-demand Clusters Cloud Cluster Physical Cluster Actions taken to provision: Button pushed on website Duration to start: 45 minutes Copying software – minutes Copying Data – minutes to hours Ready for Jobs Actions Taken to provision Budgeting Budgeting Eval Vendors Eval Vendors Picking hardware options Picking hardware options Procurement process (4-5 mo) Procurement process (4-5 mo) Waiting for Servers to ship Waiting for Servers to ship Getting Datacenter space Getting Datacenter space Upgrading Power/Cooling Upgrading Power/Cooling Unpacking, Stacking, Racking the servers Unpacking, Stacking, Racking the servers Plugging networking equipment and storage Plugging networking equipment and storage Installing images/software Installing images/software Testing Burn-in and networking addrs Testing Burn-in and networking addrs Ready for jobs Ready for jobs