HTCondor at Cycle Computing: Better Answers. Faster. Ben Cotton Senior Support Engineer.

Slides:



Advertisements
Similar presentations
Elastic HPC Extending the Cluster into the Cloud Ruth Lynch, Research IT Service 13 th November 2009.
Advertisements

Putting Eggs in Many Baskets: Data Considerations in the Cloud Rob Futrick, CTO.
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
University of Notre Dame
Ian D. Alderman CycleComputing
GPU Computing with Hartford Condor Week 2012 Bob Nordlund.
1. Topics Is Cloud Computing the way to go? ARC ABM Review Configuration Basics Setting up the ARC Cloud-Based ABM Hardware Configuration Software Configuration.
Copyright © 2007, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Copyright © 2009 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Paolo Masera BDM PBS Works South EMEA, Altair Engineering.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Faster, More Scalable Computing in the Cloud Pavan Pant, Director Product Management.
Steve Litster PhD Global Head of Scientific Computing.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
MCell Usage Scenario Project #7 CSE 260 UCSD Nadya Williams
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Cloud Computing (101).
Advanced development requires advanced tooling.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
Jason Stowe Condor Week 2009 April 22 nd, Coming to Condor Week since Started as a User.
Ian Alderman A Little History…
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Copyright © 2012 Axceleon Intellectual Property All rights reserved HPC User Forum, Dearborn MI. Our Focus: Enable HPC solutions in the Cloud for our Customer.
Block1 Wrapping Your Nugget Around Distributed Processing.
How AWS Pricing Works Jinesh Varia Technology Evangelist.
Amazon Web Services MANEESH MOHANAVILASAM. OLD IS GOLD?...NOT Predicting peaks Developing partnerships Buying and maintaining hardware Upgrading hardware.
Server to Server Communication Redis as an enabler Orion Free
1 NETE4631 Working with Cloud-based Storage Lecture Notes #11.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Objective: Enable portability and semi-automatic management of applications across clouds regardless of provider platform or infrastructure thus expanding.
Licensed under Creative Commons Attribution-Share Alike 3.0 Unported License Cloud Hosting Practices Lessons DuraSpace has learned Bill Branan Open Repositories.
Feature Synopsis Easy and fast upload as well as transfer of billing information to employees. Multiple GSM number support. Multi-provider support. Easy.
4/26/2017 Use Cloud-Based Load Testing Service to Find Scale and Performance Bottlenecks Randy Pagels Sr. Developer Technology Specialist © 2012 Microsoft.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
HUBbub 2013: Developing hub tools that submit HPC jobs Rob Campbell Purdue University Thursday, September 5, 2013.
1© Copyright 2015 EMC Corporation. All rights reserved. FEDERATION ENTERPRISE HYBRID CLOUD OPERATION SERVICES FULL RANGE OF SERVICES TO ASSIST YOUR STAFF.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.
HPC In The Cloud Case Study: Proteomics Workflow
Running LHC jobs using Kubernetes
Organizations Are Embracing New Opportunities
HPC In The Cloud Case Study: Proteomics Workflow
100% Exam Passing Guarantee & Money Back Assurance
AWS Integration in Distributed Computing
Elastic Computing Resource Management Based on HTCondor
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Working With Azure Batch AI
An Introduction to Cloud Computing
Provisioning 160,000 cores with HEPCloud at SC17
Amazon Web Services Submitted By- Section - B Group - 4
Logo here Module 3 Microsoft Azure Web App. Logo here Module Overview Introduction to App Service Overview of Web Apps Hosting Web Applications in Azure.
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.
Red Hat User Group June 2014 Marco Berube, Cloud Solutions Architect
Best Practices for Load Balancing Your GlobalSearch Installation
Azure Container Instances
AWS Cloud Computing Masaki.
Building and running HPC apps in Windows Azure
LO2 – Understand Computer Software
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Thursday AM, Lecture 1 Lauren Michael
Presentation transcript:

HTCondor at Cycle Computing: Better Answers. Faster. Ben Cotton Senior Support Engineer

We believe utility access to technical computing power accelerates discovery & invention

The Innovation Bottleneck: Scientists/Engineers forced to size their work to the infrastructure their organization bought

Our slogan Better Answers. Faster. We want our customers to get the resources they need when they need them

Better Answers…

Measure woody biomass on the southern Sahara at CM scale NASA project in partnership with –Intel –Amazon Web Services –Cycle Computing

Project goals Estimate carbon stored in trees and bushes in arid and semi-arid south Sahara Establish a baseline for future CO 2 studies of the region

The input data Images collected from satellites ~ 20 terabytes total

The workflow Pleasantly parallel Each task takes minutes 0.5 million CPU hours total

The workflow Tasks have two parts –Orthorectification and cloud detection –Feature detection Uses 2-20 GB of RAM

AWS setup Spot instances C3 and M3 instance families Data staged into S3

Job Submission DataMan uploads data from local Lustre filer to S3 When transfers complete, DataMan creates a record in CycleCloud CycleCloud batches records and builds HTCondor submit files

Job Submission Easy for the scientist

What’s next? Proof-of-concept is wrapping up Operational project expected to take approximately 1 month

Faster….

Improve hard drive design HGST runs an in-house drive head simulation suite In-house grid engine cluster runs the simulations in 30 days ~620K compute hours

We can make this faster! On Wednesday: “Hey, guys! Can we have this done by this weekend?”

We can make this faster! Un-batch the sweeps: 1.1M jobs 5-10 minute per-job runtime

Enter the cloud Used 10 AWS availability zones, spanning 3 regions Spot instances from the m3, c3, and r3 families

Pool setup One pool per availability zone Two schedulers per pool

How we did it CycleCloud autoscaled multiple instance types and multiple availability zones CycleServer spread jobs across multiple schedulers/pools based on load Used Amazon S3 instead of a shared filer

HTCondor configuration Very little! NEGOTIATOR_CYCLE_DELAY and NEGOTIATOR_INTERVAL set to 1 CLAIM_WORKLIFE set to 1 hour *_QUERY_WORKERS set to 10

HTCondor configuration SHADOW_WORKLIFE set to 1 hour JOB_START_COUNT set to 100 Disabled authentication

We did it! Went from 0 to 50k cores in 23 minutes Peaked at ~ 70K cores from 5689 instances Simulation completed in 8 hours Infrastructure cost: $5,594

Where do we go from here?

Better-er answers. Faster-er.

If you build it, they will come Large financial institution actuarial modeling –Originally just wanted to do Federal Reserve stress tests –Then month-end actuarial runs –Now regularly use 8000 cores in AWS

Coming concerns Data movement Multi-provider cloud usage Seamless burst to cloud

We write software to do this… Cycle Computing easily orchestrates workloads and data access to local and Cloud technical computing Scales from ,000’s of cores Handles errors, reliability Schedules data movement Secures, encrypts and audits Provides reporting and chargeback Automates spot bidding Supports Enterprise operations

Does this resonate with you? We’re hiring support engineers, solutions architects, sales, etc. cyclecomputing.com