Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Condor Birdbath Web Service interface to Condor
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid job submission using HTCondor Andrew Lahiff.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Using the NSF TeraGrid for Parametric Sweep CMS Applications Jeffrey P. Gardner Edward Walker Vladimir Litvin Pittsburgh Supercomputing Center Texas Advanced.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Workload Management Workpackage
Dynamic Deployment of VO Specific Condor Scheduler using GT4
High Availability in HTCondor
Building Grids with Condor
Condor: Job Management
Accounting in HTCondor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Presentation transcript:

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum - U. Wisconsin And many others!

Scientific Motivation Astronomy is increasingly being done by using large surveys with 100s of millions of objects. Analyzing large astronomical datasets frequently means performing the same analysis task on >100,000 objects. Each object may take several hours of computing. The amount of computing time required may vary, sometimes dramatically, from object to object.

Solution: PBS? In theory, PBS should provide the answer. Submit 100,000 single-processor PBS jobs In practice, this does not work. Teragrid nodes are multiprocessor Only 1 PBS job per node Teragrid machines frequently restrict the number of jobs a single user may run. Chad might get really mad if I submitted 100,000 PBS jobs!

Solution: mprun? We could submit a single job that uses many processors. Now we have a reasonable number of PBS jobs (Chad will now be happy). Scheduling priority would reflect our actual resource usage. This still has problems. Each job takes a different amount of time to run: we are using resources inefficiently.

The Real Solution: Condor+GridShell The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job. We can even submit large PBS jobs to multiple Teragrid machines, then farm out serial work units as resources become availiable. Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object

The Real Solution: Condor+GridShell The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job. We can even submit large PBS jobs to multiple Teragrid machines, then farm out serial work units as resources become availiable. Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object Condor GridShell

Condor Overview Condor was first designed as a CPU cycle harvester for workstations sitting on people’s desks. Condor is designed to schedule large numbers of jobs across a distributed, heterogeneous and dynamic set of computational resources.

Condor: The User Experience 1. User writes a simple Condor submit script: # my_job.submit: # A simple Condor submit script Universe = vanilla Executable = my_program Queue 2. User submits the job: % condor_submit my_job.submit Submitting job(s). 1 job(s) submitted to cluster 1.

Condor: The User Experience 3. User watches job run: 4. Job completes. User is happy. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 Jeff 6/16 06: :01:21 R my_program 1 jobs; 0 idle, 1 running, 0 held %

Advantages of Condor Condor user experience is simple Condor is flexible Resources can be any mix of architectures Resources do not need a common filesystem Resources do not need common user accounting Condor is dynamic Resources can disappear and reappear Condor is fault-tolerant Jobs are automatically migrated to new resources if existing one become unavailable.

Condor Daemons condor_startd – ( runs on execution node ) Advertises specs and availability of execution node (ClassAds). Starts jobs on exec. node. condor_schedd – ( runs on submit node ) Handles job submission. Tracks job status. condor_collector – ( runs on central manager ) Collects system information from execution node. condor_negotiator –( runs on central manager ) Matches schedd jobs to machines.

Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Startd sends system specifications (ClassAds) and system status to Collector

Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Schedd sends job info to Negotiator User submits Condor job

Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Negotiator uses information from Collector to match Schedd jobs to available Startds

Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Schedd sends job to Startd on assigned execution node

“Personal” Condor on a Teragrid Platform Condor daemons can be run as a normal user. Condor “GlideIn”™ ability supports the ability to launch condor_startd’s on nodes within an LSF or PBS job.

“Personal” Condor on a Teragrid Platform (Condor runs with normal user permissions) Central Manager collector negotiator Submission Machine schedd Execution PE startd Execution PE startd Execution PE startd Submission Machine (could be login node) Login Node PBS Job - GlideIn

GridShell Overview Allows users to interact with distributed grid computing resources from a simple shell-like interface. extends TCSH version 6.12 to incorporates grid-enabled features: parallel inter-script message-passing and synchronization output redirection to remote files parametric sweep

GridShell Examples Redirecting the standard output of a command to a remote file location using GlobusFTP: a.out > gsiftp://tg-login.ncsa.teragrid.org/data Message passing between 2 parallel tasks: if ( $_GRID_TASKID == 0) then echo "hello" > task_1 else Set msg=`cat < task_0` endif Executing 256 instances of a job: a.out on 256 procs

Merging GridShell with Condor Use GridShell to launch Condor GlideIn jobs at multiple grid sites All Condor GlideIn jobs report back to a central collector This converts the entire Teragrid into your own personal Condor pool!

Merging GridShell with Condor Login Node Gridshell event monitor SDSC PSC NCSA User starts GridShell Session at PSC

Merging GridShell with Condor Login Node Gridshell event monitor Login Node Gridshell event monitor Login Node Gridshell event monitor SDSC PSC NCSA GridShell session starts event monitor on remote login nodes via Globus

Merging GridShell with Condor Login Node collector negotiator schedd Gridshell event monitor Login Node Gridshell event monitor Login Node Gridshell event monitor SDSC PSC NCSA Local event monitor starts condor daemons on login node

Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA All event monitors submit Condor GlideIn PBS jobs

Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA Condor startd’s tell collector that they have started

Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA Condor schedd distributes independent work units to compute nodes

GridShell in a NutShell Using GridShell coupled with Condor one can easily harness the power of the Teragrid to process large numbers of independent work units. Scheduling can be done dynamically from a central Condor queue to multiple grid sites as clusters of processors become availible. All of this fits into existing Teragrid software.

Merging GridShell with Condor Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA