FESR Consorzio COMETA - Progetto PI2S2 Using MPI to run parallel jobs on the Grid Marcello Iacono Manno Consorzio COMETA TUTORIAL GRID PER I LABORATORI NAZIONALI DEL SUD 26 Febbraio 2008
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio Outline Overview Requirements & Settings How to create a MPI job How to submit a MPI job to the Grid
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio Currently parallel applications use “special” HW/SW Parallel application are “normal” on a Grid Many are trivially parallelizable Grid middleware offers several parallel jobs (DAG, collection) A common solution for non – trivial parallelism is: Message Passing Interface (MPI) – based on send() and receive() primitives – a “master” node starts some processes “slaves” by establishing SSH sessions – all processes can share a common workspace and/or exchange data Overview
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio Several MPI implementations but only two of them are currently supported by the Grid Middleware: – MPICH – MPICH2 Both “old” GigaBit Ethernet and “new” low-latency InfiniBand nets are supported – Cometa infrastructure will run MPI jobs on either GigaBit (MPICH, MPICH2) or InfiniBand (MVAPICH, MVAPICH2) Currently, MPI parallel jobs can run inside a single Computing Elements (CE) only – several projects are involved into studies concerning the possibility of executing parallel jobs on Worker Nodes (WNs) belonging to different CEs MPI & Grid
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio From the user’s point of view, MPI jobs are specified by setting the JDL JobType attribute to MPICH, MPICH2, MVAPICH, MVAPICH2 specifying the NodeNumber attribute as well JobType = “MPICH”; NodeNumber = 2; This attribute defines the required number of CPU cores (PEs) Matchmaking: the Resource Broker (RB) chooses a CE (if any!) with enough free Processing Elements (PE = CPU cores) e.g.: free PE# ≥ NodeNumber (otherwise “wait!”) JDL (1/3)
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio When these two attributes are included in a JDL script the following expression is automatically added: (other.GlueCEInfoTotalCPUs >= NodeNumber) && Member (“MPICH”,other.GlueHostApplicationSoftwareRunTimeEnvironment) to the JDL requirements expression in order to find out the best resource where the job can be executed JDL (2/3)
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio Executable specifies the MPI executable NodeNumber specifies the number of cores Arguments specifies the WN command line – Executable + Arguments form the command line on the WN mpi.pre.sh is a special script file that is sourced before launching MPI executable – warning: it runs only on the master node actual mpirun command is issued by the middleware (… what if a proprietary script/bin?) mpi.pre.sh is a special script file that is sourced after MPI executable termination – warning: it runs only on the master node JDL (3/3)
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio In order to assure that a MPI job can run, the following requirements MUST BE satisfied: MPICH/MPICH2/MVAPICH/MVAPICH2 – the MPICH/MPICH2/MVAPICH/MVAPICH2 software must be installed and placed in the PATH environment variable, on all the WNs of the CE – some MPI applications require a file system shared among the WNs: no shared area currently available to write user data application may access the area of the master node (requires modifications to the application) middleware solutions are also possible (as soon as required/designed/tested/deployed) Requirements (1/2)
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio Requirements (2/2) Job wrapper copies all the files indicated in the InputSandbox on ALL of the “slave” nodes host based ssh authentication MUST BE well configured between all the WNs If some environment variables are needed ONLY on the “master” node, they can be set by the mpi.pre.sh If some environment variables are needed ON ALL THE NODES, a static installation is currently required (middleware extension is under consideration)
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio [ Type = "Job"; JobType = "MPICH"; Executable = “MPIparallel_exec”; NodeNumber = 2; Arguments = “arg1 arg2 arg3"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {“mpi.pre.sh”,“mpi.post.sh”, “MPIparallel_exec”}; OutputSandbox = {“test.err”, “test.out”, “executable.out”}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF"; ] mpi.jdl Local Resource Manager (LRMS) = PBS/LSF only Local Resource Manager (LRMS) = PBS/LSF only Pre e Post Processing Scripts Executable
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio GigaBit vs InfiniBand The advantage of using a low – latency network becomes more evident the greater the number of nodes
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio CPI Test (1/4) mpi-0.13]$ edg-job-submit mpi.jdl Selected Virtual Organisation name (from proxy certificate extension): cometa Connecting to host infn-rb-01.ct.pi2s2.it, port 7772 Logging to host infn-rb-01.ct.trigrid.it, port 9002 ********************************************************************************************* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - *********************************************************************************************
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio CPI Test (2/4) mpi-0.13]$ edg-job-status 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: infn-ce-01.ct.pi2s2.it:2119/jobmanager-lcglsf-short reached on: Sun Jul 1 15:08: *************************************************************
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio CPI Test (3/4) mpi-0.13]$ edg-job-get-output --dir /home/marcello/JobOutput/ 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw Retrieving files from host: infn-rb-01.ct.pi2s2.it ( for 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - have been successfully retrieved and stored in the directory: /home/marcello/JobOutput/marcello_vYGU1UUfRnSktGODcwEjMw *********************************************************************************
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio CPI Test (4/4) mpi-0.13]$ cat /home/marcello/JobOutput/marcello_vYGU1UUfRnSktGODcwEjMw/test.out preprocessing script infn-wn-01.ct.pi2s2.it Process 0 of 4 on infn-wn-01.ct.pi2s2.it pi is approximately , Error is wall clock time = Process 1 of 4 on infn-wn-01.ct.pi2s2.it Process 3 of 4 on infn-wn-02.ct.pi2s2.it Process 2 of 4 on infn-wn-02.ct.pi2s2.it TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME ==== ========== ================ ======================= =================== 0001 infn-wn-01 /opt/lsf/6.1/lin Done 07/01/ :04: infn-wn-01 /opt/lsf/6.1/lin Done 07/01/ :04: infn-wn-02 /opt/lsf/6.1/lin Done 07/01/ :04: infn-wn-02 /opt/lsf/6.1/lin Done 07/01/ :04:23 P4 procgroup file is /home/cometa005/.lsf_6826_genmpi_pifile. postprocessing script temporary mpi-0.13]$
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio MPI on the web
Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio Questions…