Brian Lin OSG Software Team University of Wisconsin - Madison

Slides:

Advertisements

Similar presentations

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.

Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

Click to add text TWA Cloud Integration with Tivoli Service Automation Manager TWS Education.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Introduction to Cloud Computing

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Block1 Wrapping Your Nugget Around Distributed Processing.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Microsoft Management Seminar Series SMS 2003 Change Management.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Web Technologies Lecture 13 Introduction to cloud computing.

What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.

John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.

© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.

Five todos when moving an application to distributed HTC.

How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Ways to Connect to OSG Tuesday, Wrap-Up Lauren Michael, CHTC.

UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.

Christina Koch Research Computing Facilitators

High Performance Computing (HPC)

Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University

HTCondor Annex (There are many clouds like it, but this one is mine.)

Large Output and Shared File Systems

Volunteer Computing for Science Gateways

U.S. ATLAS Grid Production Experience

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison

High Availability in HTCondor

Moving from CREAM CE to ARC CE

CREAM-CE/HTCondor site

OSG Connect and Connect Client

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Cloud based Open Source Backup/Restore Tool

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Integration of Singularity With Makeflow

Troubleshooting Your Jobs

Ian Ramsey C of E School GCSE ICT Smart working Software choices.

Haiyan Meng and Douglas Thain

Zhen Xiao, Qi Chen, and Haipeng Luo May 2013

The Globus Toolkit™: Information Services

What’s Different About Overlay Systems?

Introduction to High Throughput Computing and HTCondor

Brian Lin OSG Software Team University of Wisconsin - Madison

King of France, Louis XVI Restorer of French Liberty

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Backfilling the Grid with Containerized BOINC in the ATLAS computing

Grid Laboratory Of Wisconsin (GLOW)

GLOW A Campus Grid within OSG

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Thursday AM, Lecture 1 Lauren Michael

Thursday AM, Lecture 2 Brian Lin OSG

Introduction to research computing using Condor

Presentation transcript:

Brian Lin OSG Software Team University of Wisconsin - Madison Introduction to DHTC Brian Lin OSG Software Team University of Wisconsin - Madison

Local High Throughput Computing resources local UW - Madison compute

Local High Throughput Computing resources local UW - Madison compute

How do you get more computing resources?

#1: Buy Hardware Great for specific hardware/privacy requirements Costs $$$ Initial cost Maintenance Management Power and cooling Rack/floor space Obsolescence Plan for peak usage, pay for all usage Delivery and installation takes time

#2: Use the Cloud - Pay per cycle Amazon Web Services, Google Compute Engine, Microsoft Azure, etc. Fast spin-up Costs $$$ Still needs expertise + management Easier than in the past with the condor_annex tool Does payment fit with your institutional or grant policies?

#2: Use the Cloud - ‘Managed’ clouds Cycle Computing, Globus Genomics Pay someone to manage your cloud resources — still costs $$$ Researchers and industry have used this to great success Using Docker, HTCondor, and AWS for EDA Model Development Optimizations in running large-scale Genomics workloads in Globus Genomics using HTCondor HTCondor in the enterprise HTCondor at Cycle Computing: Better Answers. Faster.

#3: Share Resources - Distributed HTC University of Chicago University of Nebraska - Lincoln UW - Madison

Manual Job Split Obtain login access Query each cluster for idle resources Split and submit jobs based on resource availability Photo by Denys Nevozhai on Unsplash

#3: Share Resources - Distributed HTC University of Chicago University of Nebraska - Lincoln UW - Madison

#3: Share Resources - Distributed HTC University of Chicago University of Nebraska - Lincoln UW - Madison

Manual Job Split - Shortcomings Fewer logins = fewer potential resources More logins = more account management Why would they give you accounts? Are your friends going to want CHTC accounts? Not all clusters use HTCondor — other job schedulers e.g., Slurm, PBS, etc. Querying and splitting jobs is tedious and inaccurate

Automatic Job Split - Shortcomings Homer: Kids: there's three ways to do things; the right way, the wrong way and the Max Power way! Bart: Isn't that the wrong way? Homer: Yeah, but faster! Groening, M (Writer), Michels, P. (Director) . (1999). Homer to the Max [Television Series Episode]. In Scully, M. (Executive Producer), The Simpsons. Los Angeles, CA: Gracie Films

Automatic Partitions - Shortcomings Source: https://xkcd.com/1319/

#3: Share Resources - Requirements Minimal account management No job splitting HTCondor only! No resource sharing requirements

The OSG Model OSG Submit and CM OSG Cluster

The OSG Model OSG Submit and CM OSG Cluster

The OSG Model Pilot Jobs OSG Submit and CM OSG Cluster

The OSG Model Pilot Jobs OSG Submit and CM OSG Cluster

Job Matching On a regular basis, the central manager reviews Job and Machine attributes and matches jobs to slots. execute submit central manager execute execute

The OSG Model OSG Submit and CM OSG Cluster

The OSG Model - Jobs in Jobs Photo Credit: Shereen M, Untitled, Flickr https://www.flickr.com/photos/shereen84/2511071028/ (CC BY-NC-ND 2.0)

The OSG Model - Details Pilot jobs (or pilots) are special jobs Pilots are sent to sites with idle resources Pilot payload = HTCondor execute node software Pilot execute node reports to your OSG pool Pilots lease resources: Lease expires after a set amount of time or lack of demand Leases can be revoked!

#3: Share Resources - Requirements Minimal account management: only one submit server No job splitting: only one HTCondor pool HTCondor only: pilots report back as HTCondor slots, you’ll be using an HTCondor submit host No resource sharing requirements: the OSG doesn’t require that users “pay into” the OSG

The OSG Model - Collection of Pools Your OSG pool is just one of many Separate pools for each Virtual Organization (VO) Your jobs will run in the OSG VO pool Photo by Martin Sanchez on Unsplash

The OSG Model - Getting Access During the school: learn and training submit host (exercises) After the school: learn.chtc.wisc.edu for 1 year! training.osgconnect.net for 1 month! Register for OSG Connect Institution-hosted submit node VO-hosted submit nodes

Quick Break: Questions?

Pilot jobs are awesome! Photo by Zachary Nelson on Unsplash

What’s the Catch? Requires more infrastructure, software, set-up, management, troubleshooting...

“You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” - Leslie Lamport

#1: Heterogenous Resources Accounting for differences between the OSG and your local cluster

Sites of the OSG Source: http://display.opensciencegrid.org/

Heterogeneous Resources - Software Different operating systems (Red Hat, CentOS, Scientific Linux; versions 6 and 7) Varying software versions (e.g., at least Python 2.6) Varying software availability (e.g., no BLAST*) Solution: Make your jobs more portable: OASIS, containers, etc (more in talks later this week)

Hetero. Resources - Hardware CPU: Mostly single core RAM: Mostly < 8GB GPU: Limited #s but more being added Disk: No shared file system (more in Thursday’s talks) Solution: Split up your workflow to make your jobs more high throughput

#2: With Great Power Comes Great Responsibility How to be a good netizen

Resources You Don’t Own Primary resource owners can kick you off for any reason No local system administrator relationships No sensitive data! Photo by Nathan Dumlao on Unsplash

Be a Good Netizen! Use of shared resources is a privilege Only use the resources that you request Be nice to your submit nodes Solution: Test jobs on local resources with condor_submit -i

Leasing resources takes time! #3: Slower Ramp Up Leasing resources takes time!

Slower Ramp Up Adding slots: pilot process in the OSG vs slots already in your local pool Not a lot of time (~minutes) compared to most job runtimes (~hours) Small trade-off for increased availability Tip: If your jobs only run for < 10min each, consider combining them so each job runs for at least 30min

Job Robustification Test small, test often Specify output, error, and log files at least while you develop your workflow In your own code: Self checkpointing Defensive troubleshooting (hostname, ls -l, pwd, condor_version in your wrapper script) Add simple logging (e.g. print, echo, etc)

Hands-On Questions? Dynamic pool demo! Exercises 4.1 - 4.3: Submitting jobs in the OSG 4.4 - 4.5: Identifying differences in the OSG Remember, if you don’t finish, that’s ok! You can make up work later or during evenings, if you’d like.