Heterogeneous Grid Design and Implementation Thesis Presentation By Jeffrey Wells State University New York Institute of Technology May 7, 2008 CSC 599
Outline Purpose Overview Intro to Globus Toolkit and Condor Interoperability Experiments Results Conclusion
Purpose This thesis investigates the extent to which two open source approaches to Grid computing achieves interoperability. The Globus Alliance’s Globus Toolkit and the University of Wisconsin-Madison’s Condor scheduler were used, in this thesis, to offer an example of interoperability.
Overview What is a Grid? Condor Scheduler Globus Toolkit BITS Regional Grid SUNYIT Local Grid Network Grid Security
What is a Grid? What is a Grid you might ask… definition given by (Ian Foster of the University of Chicago) – is a system that coordinates resources that are not subject to centralized control uses standardized, open, general purpose protocols and interfaces delivers non- trivial qualities of service Examples of Grids (TeraGrid has 20 Teraflops of computing power and 1 Petabyte of storage, Access Grid used for scheduling and conducting meetings, and eDiaMoND used for medical research in England)
Condor Scheduler Condor High Throughput Computing (HTC) – Ties idle resources together to harness their idle resource in a distributed fashion. Condor was developed by the University of Wisconsin-Madison Other distributed schedulers … PBS (Portable Batch System ) LSF (Load Sharing Facility) CSF (Community Scheduler Framework) SETI (Search of Extraterrestrial Intelligence)
Globus Toolkit The Globus Toolkit is an open source software toolkit used for building Grid systems and applications. It is constantly being developed by the Globus Alliance at the University of Chicago and many others all over the world. Other type of Grid toolkit… Virtual Data Toolkit (VDT)
BITS Regional Grid bitsgw qw.cs. sunyit. edu Corning Community College SUNY Geneseo SUNYIT
SUNY IT Local Grid Network Globus Globus Condor Globus Globus 405 Condor Condor 605 bitsgw
Grid Security Grid Security Infrastructure (GSI) implements public key cryptography as the backbone for its functionality The reasons behind GSI are: the requirement for secure communication between resources of a Grid; prevent a centrally managed security system allow for a “signal sign-on” for users of the Grid. This includes delegation of credentials for jobs that require more than one resource and /or sites
SUNY Geneseo Debian Linux Cluster Condor Execute/Submit Services used, tested and evaluated: GridFTP, RFT (Reliable File Transfer) Delegation, authentication authorization Credential management Grid Security Infrastructure (GSI) Various Condor submits Globus Services
Condor Central Manager (Scheduler) Central Manager Submit/Execute Globus Central Manager Condor Central Manager (Scheduler) submits jobs either to a Condor Submit/Execute or Globus Machine. Each machine “advertises” via ClassAd to Central Manager its resources Central Manager matches up resource with submitted job requires Central Manger sends executable to remote resource that matches requirement. Once job is completed, Execute Machine reports back to Central Manager Central Manager reports final results. ClassAd/Results Job Request ClassAd/Results Job Request ClassAd/Results
Various Jobs Implemented Condor Jobs Vanilla Standard Java Parallel Grid (Globus) Globus Jobs Forwarded a job to Condor machine with a scheduler From a Condor scheduler to a Globus machine (Globus Job). Forward Jobs to other Globus machines.
Interoperability Experiments Globus, Condor and Condor-G Condor-G Interface Job Examples Condor to Globus Job Submit Globus to Condor Job Submit Test Scripts Swift Workflow Some More Test Scripts
Globus, Condor and Condor-G Linux Cluster Condor Workstation Pool Globus Services Condor Scheduler Condor-G manages jobs through the resource manager of the Globus Toolkit. Results of the Job passed to the Globus Toolkit are returned via the Condor-G interface. Condor_startd advertises about the resource and executes the job. Condor_starter spawns the remote job. Condor_shadow maintains the resources. Condor_master is responsible for keeping all the rest of the Condor daemons running. Condor_schedd submits jobs to remote resources for the job queue. Condor_negotiator is responsible for the match making.
Condor-G Interface Linux Cluster Globus Services Condor Workstation Pool Condor-G uses the Globus resource manager to start a job on the remote machine. It also manages the job running on the remote resource. Condor-G waits for the job to be completed and then returns the results. Condor-G interface
Job Examples Condor Job and Globus Script ====================== == Condor to Globus == test.submit ====================== universe = grid executable = myscript.sh arguments = TestJob 10 JobManager_type = Condor grid_type = gt4 globusscheduler = es/ ManagedJobFactoryService/ log = test.log output = test.output error = test.error should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2date echo "Running as binary $0" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 0 SUCCESS“ Condor Job and MPI Program ########################## # Submit description file # for /bin/hostname # (Parallel) ######################### universe = parallel executable = /bin/hostname machine_count = 2 log = parallellogfile output = outfileMPI.$(NODE) error = errfileMPI.$(NODE) should_transfer_files = YES when_to_transfer_output = ON_EXIT queue MPI Program #include "mpi.h" #include int main( int argc, char* argv[] ) { int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, & size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0; }
Condor to Globus Job Submit Condor-G Condor (Scheduler) GASS Server Gate Keeper Job Manager Globus Toolkit Job 1.) Central Manager submits grid job 2.) Job Passes through Condor- G to Globus gate keeper 3.) Verify security via gate keeper 4.) Forward job to job manager5.) Process and return result to Central manager
Globus to Condor Job Submission Gram Client GASS Server GRAM Gatekeeper GRAM Job Manager Batch System Condor GASS Client Local Machine Remote Machine GRAM Job Request Creation Job RequestData Callback Grid - Proxy
Sample Test Scripts Perl Scripts were created to test most functionality of the BITS regional Grid Job submit from Globus to Condor print " \n------> Submitting a Job to Condor on Stengel < \n"; system "globusrun-ws -submit -Ft Condor -S -c /bin/date"; Job submit from Condor to Globus print "-----> Submitting a Condor Globus Job < \n"; system "condor_submit /home/wells/testjobs/condorjobs/globussubmits/submitGFor k";
Swift Workflow Swift is a data-oriented coarse-grained scripting language that supports dataset typing and mapping, dataset iteration, conditional branching, and sub-workflow composition The Swift programs, also known as workflows, are written in a language called SwiftScript Swift handles the execution of these programs on remote sites
Sample Test Scripts cont. Swift Job submit to SUNYIY3 (Geneseo) print "\n \n"; system "swift sites.file /home/wells/testjobs/swiftjobs/sites3.xml /home/wells/testjobs/swiftjobs/first.swift";
Results Condor.pm is malformed for job submits from Globus to Condor. Addition of should_transfer_files = YES and when_to_transfer_output = ON_EXIT must be added to script. -S is used in the Globus Toolkit versus –s in Mpiexe.py, mpdlib.py was modified so that ws-gram was able to send a distributed job to MPICH2. Thanks to Dr. Ralph Butler of Middle Tennessee State University. Another application layer can easily be added to the Globus Toolkit. Applications are changing and maturing faster than the documentation. Mail groups and lists are not always helpful nor do they respond to questions. Documentation is scarce on the MPI-2 and Globus Toolkit connection and is also outdated. Documentation on the Condor and Globus interface is outdated. Resolved by installing Condor and then Globus with Condor scheduler.
Conclusion 1. It is necessary to modify the Condor.pm script in order to allow the Globus Toolkit to submit jobs to the Condor Scheduler. 2. It is necessary to correct Mpiexe.py, mpdlib.py in order for the Globus Toolkit to submit a distributed job to MPICH2. 3. Investigation found that –S is now used to submit a job to Condor under versus the –s under Another application layer can be easily added to the Globus Toolkit without effecting the interoperability with the Condor Scheduler. 5. Documentation is scarce on the MPI-2 and Globus Toolkit connection and is also outdated. 6. Applications are changing and maturing faster than the documentation.
References Globus Toolkit Version 4 Grid Security Infrastructure: A Standards Perspective. The Globus Security Team, Version 4 updated September 12, Retrieved on September 26, 2007 from Overview.pdf/ Overview.pdf/ Tanenbaum, A.(2003) Computer Networks Fourth Edition. New Jersey: Prentice Hall PTR Condor Users Manual Version 6.8 (2007) Retrieved September 24, 2007 from Globus Toolkit Administration Manual (2007) Retrieved September 24, 2007 from Swift Users Guide (Change Revision 1700). Retrieved on February 16, 2008 from Swift – Home (2007), retrieved on February 16, 2008 from Yong Zhao, Michael Hadean, Ben Clifford, Ian Foster, Gregor von Laszewski, Ioan Raicu, Tiberiu Stef-Praun, Mike Wilde Swift: Fast, Reliable, Loosely Coupled Parallel Computation (2007), retrieved on March 2, 2008 from
References (cont.) Mausolf, J. (2005) Grid In Action: Implementation SOA and Web Services In Grid. (2005, August 09). Retrieved September 24, 2007, from Foster, I. (2002) What is a Grid? A Three Point Checklist. Argonne National Laboratory & University of Chicago. Retrieved September 2, 2007 from Overview of the Grid Security Infrastructure, Globus Alliance Globus Toolkit. Retrieved May 6, 2008 from Noel, C (2007). What is a Grid? CETIC’s Tentative Definition. Retrieved on September 6, 2007 from