What’s Different About Overlay Systems?

Slides:



Advertisements
Similar presentations
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
PBS Job Management and Taskfarming Joachim Wagner
Virtual Machines for HPC Paul Lu, Cam Macdonell Dept of Computing Science.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Problem Solving and Diagnostic Skills Software Maintenance.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Workflows: from Development to Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
Next Generation of Apache Hadoop MapReduce Owen
Five todos when moving an application to distributed HTC.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
1 Policy Based Systems Management with Puppet Sean Dague
Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Christina Koch Research Computing Facilitators
Advanced Computing Facility Introduction
High Performance Computing (HPC)
Compute and Storage For the Farm at Jlab
GRID COMPUTING.
Welcome to Indiana University Clusters
Development Environment
Condor DAGMan: Managing Job Dependencies with Condor
Welcome to Indiana University Clusters
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Discussion 11 Final Project / Git.
High Availability in HTCondor
OSG Connect and Connect Client
Integration of Singularity With Makeflow
Troubleshooting Your Jobs
Haiyan Meng and Douglas Thain
High-Throughput Computing in Atomic Physics
Job Matching, Handling, and Other HTCondor Features
Brian Lin OSG Software Team University of Wisconsin - Madison
Genre1: Condor Grid: CSECCR
The Condor JobRouter.
King of France, Louis XVI Restorer of French Liberty
Introduction to High Performance Computing Using Sapelo2 at GACRC
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Brian Lin OSG Software Team University of Wisconsin - Madison
Introduction to research computing using Condor
Presentation transcript:

What’s Different About Overlay Systems? Brian Lin OSG Software Team University of Wisconsin - Madison

Overlay Systems are Awesome! Photo by Zachary Nelson on Unsplash

What’s the Catch? Requires more infrastructure, software, set-up, management, troubleshooting...

“You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” - Leslie Lamport

#1: Heterogenous Resources Accounting for differences between the OSG and your local cluster

Sites of the OSG Source: http://display.opensciencegrid.org/

Heterogeneous Resources - Software Different operating systems (Red Hat, CentOS, Scientific Linux; versions 6 and 7) Varying software versions (e.g., at least Python 2.6) Varying software availability (e.g., no BLAST*) Solution: Make your jobs more portable: OASIS, containers, etc (more in Wednesday’s talks)

Hetero. Resources - Hardware CPU: Mostly single core RAM: Mostly < 8GB GPU: Limited #s but more being added Disk: No shared file system (more in Thursday’s talks) Solution: Split up your workflow to make your jobs more high throughput

#2: With Great Power Comes Great Responsibility How to be a good netizen

Resources You Don’t Own Primary resource owners can kick you off for any reason No local system administrator relationships No sensitive data (again)! Photo by Nathan Dumlao on Unsplash

Be a Good Netizen! Use of shared resources is a privilege Only use the resources that you request Be nice to your submit nodes Solution: Test jobs on local resources with condor_submit -i

Leasing resources takes time! #3: Slower Ramp Up Leasing resources takes time!

Slower Ramp Up Adding slots: pilot process in the OSG vs slots already in your local pool Not a lot of time (~minutes) compared to most job runtimes (~hours) Small trade-off for increased availability Tip: If your jobs only run for < 10min each, consider combining them so each job runs for at least 30min

Succeeding in the face of failure Robustify Your Jobs Succeeding in the face of failure

Job Robustification Test small, test often Specify output, error, and log files at least while you develop your workflow Use on_exit_hold to catch different failure modes on_exit_hold = (ExitCode =?= 3) on_exit_hold = (time() - JobCurrentStartDate < 1 * $(HOUR)) For jobs that run too long: periodic_hold = (time() - JobCurrentStartDate > 4 * $(HOUR)) periodic_release = (HoldReasonCode == 3) && (NumJobStarts < 3) HoldReasonCode is 3 for any jobs where on_exit_hold or periodic_hold evaluate to True

Job Robustification In your own code: Self checkpointing Different exit codes for use with on_exit_hold Defensive troubleshooting (hostname, ls -l, pwd, condor_version in your wrapper script) Add simple logging (e.g. print, echo, etc)

Questions?