Testing the EGI-DRIHM TestBed D.Cesini
Preliminary tests Authentication MPI && MPI-START published CE: HelloWorld JDL submission SE: Lcg-rep of WRF input data file lcg-rep -v -d SE srm://darkstorm.cnaf.infn.it/drihm.eu/generated/2013-07-18/file05cf726f-1894-4f08-8531-c516d4144403 (using LFC=lfc.ipb.ac.rs lfn:/grid/drihm.eu/cesini/genova.tgz) Repeated twice with certificates released by the two replica VOMS servers MPI && MPI-START published Requirements = (other.GlueCEStateStatus == "Production") && Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment) Member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment) Published Total CPUs CPUs GlueCEInfoTotalCPUs in SubCluster Published OS version GlueHostOperatingSystemRelease
WRF test WRF (v3.4.1) compiled in SL6 using OPENMPI (v1.6.4) and NETCDF libs (v 4.2.1.1) Input data prepared by Antonio Parodi for the Genoa flooding case on 4th Nov 2011 Data available for a run that starts on 4-11-2011 00:00 and ends on 5-11-2011 00:00 Two nested domains, one coarse and one fine integration grid Just one simulated hour run Just the coarse grid used (no nesting) Executable, input data, configuration files (namelist.input) and netcdf libs uploaded in Grid in a tgz file (world-readable) lfn:/grid/drihm.eu/cesini/genova.tgz CPUNumber = 40 (because we have the reference timings obtained at LRZ-LMU by Antonio for 40, 80, 120 processors ) No SMPGranularity required Submitted only if the preliminary tests were OK
WRF JDL CPUNumber = 40; #SMPGranularity = 8; Executable = "/usr/bin/mpi-start"; Arguments = "-t openmpi -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./netcdf-lib/ -d MPI_START_TEMP_DIR=\"$HOME/\" -vvv ./wrf.exe"; StdOutput = "std.out"; StdError = "std.err"; InputSandbox = {"wrf-prologue.sh"}; OutputSandbox = {"std.err", "std.out", "prologue.log" , "rsl.out.0000" , "rsl.error.0000" }; Prologue = "wrf-prologue.sh"; Requirements = ( (other.GlueCEStateStatus == "Production") && Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment) Member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment) && (other.GlueCEUniqueID=="cream-02.cnaf.infn.it:8443/cream-pbs-prod-sl6") ) ; RetryCount = 0; ShallowRetryCount = -1; MyProxyServer="myproxy.cnaf.infn.it"; FuzzyRank = true; [$$] cat wrf-prologue.sh export LFC_HOST=lfc.ipb.ac.rs lcg-cp -v srm://darkstorm.cnaf.infn.it/drihm.eu/generated/2013-07-18/file05cf726f-1894-4f08-8531-c516d4144403 file:genova.tgz tar -xvzf genova.tgz >> prologue.log 2>&1 cp genova/* .
Available Resources (CE) [cesini2@igi-ui ~]$ lcg-infosites --vo drihm.eu ce # CPU Free Total Jobs Running Waiting ComputingElement ---------------------------------------------------------------- 408 360 0 0 0 ce.ceta-ciemat.es:8443/cream-sge-drihm 1560 1520 1 1 0 ce.hpgcc.finki.ukim.mk:8443/cream-pbs-drihm 624 302 1 1 0 ce64.ipb.ac.rs:8443/cream-pbs-drihm 358 58 0 0 0 cream-02.cnaf.infn.it:8443/cream-pbs-prod-sl6 180 103 0 0 0 cream-ce01.ariagni.hellasgrid.gr:8443/cream-pbs-drihm 118 21 0 0 0 cream-ce01.marie.hellasgrid.gr:8443/cream-pbs-drihm 4 4 0 0 0 cream-ce02.marie.hellasgrid.gr:8443/cream-pbs-drihm 104 52 0 0 0 cream.afroditi.hellasgrid.gr:8443/cream-pbs-drihm 624 339 1 0 1 cream.ipb.ac.rs:8443/cream-pbs-drihm 196 8 0 0 0 cream01.athena.hellasgrid.gr:8443/cream-pbs-drihm 398 0 2 0 2 cream01.grid.uoi.gr:8443/cream-pbs-drihm 104 48 0 0 0 cream01.kallisto.hellasgrid.gr:8443/cream-pbs-drihm 392 0 1 0 1 cream02.athena.hellasgrid.gr:8443/cream-pbs-drihm 224 123 0 0 444444 cream1.grid.cesnet.cz:8443/cream-pbs-drihm 224 123 0 0 444444 cream2.grid.cesnet.cz:8443/cream-pbs-drihm 3880 0 1213 963 250 dissel.nikhef.nl:2119/jobmanager-pbs-medium 64 64 0 0 0 emi-ce01.scope.unina.it:8443/cream-pbs-hpc 3880 1 0 0 0 gazon.nikhef.nl:8443/cream-pbs-flex 3880 1 1213 963 250 gazon.nikhef.nl:8443/cream-pbs-medium 3880 2 0 0 0 juk.nikhef.nl:8443/cream-pbs-flex 3880 2 1213 963 250 juk.nikhef.nl:8443/cream-pbs-medium 3880 6 0 0 0 klomp.nikhef.nl:8443/cream-pbs-flex 3880 6 1213 963 250 klomp.nikhef.nl:8443/cream-pbs-medium 55 0 1 0 1 snf-10952.vm.okeanos.grnet.gr:8443/cream-pbs-drihm GT5 LRZ-LMU not publishing in the IS on 18/07 available resource for DRIHM.eu - I had no time to investigate with the sites
Available Resources (SE) [cesini2@igi-ui ~]$ lcg-infosites --vo drihm.eu se Avail Space(kB) Used Space(kB) Type SE ------------------------------------------ 149210299 150789700 SRM darkstorm.cnaf.infn.it 8759271843 21543304907 SRM dpm.ipb.ac.rs 175109859 96715642 SRM se.hpgcc.finki.ukim.mk 1004763730 6991387962 SRM se01.afroditi.hellasgrid.gr 2470142232 749367310 SRM se01.ariagni.hellasgrid.gr 8589864576 70016 SRM se01.athena.hellasgrid.gr 242994986 696998531 SRM se01.grid.uoi.gr 2307901070 2976565813 SRM se01.kallisto.hellasgrid.gr 404758881 239343606 SRM se02.marie.hellasgrid.gr 8794743316 812834 SRM tbn18.nikhef.nl
Results Authentication: MPI && MPI-START Published 4 sites failed using both VOMSes proxies on CE and SE 1 sites failed for one of the VOMSes proxies, ok with the other one 1 site Ok on CE but failing on SE 9 sites OK 1 CE at NIKHEF is a GRAM5 based CE and AUTH worked fine MPI && MPI-START Published 3 sites do not publish OPENMPI and MPI-START in GlueHostSoftwareRunTimeEnvironment in all the CEs 1 site does not publish in all CEs OPENMPI and MPI-START 10 sites publish both TAGs in all CEs Published Total CPUs 1 site has one CE publishing just 4 CPUs Published OS version 6 sites pubblish SL6.x - 8 sites pubblish SL5.x The WRF test could be run in 3 sites that passed all the preliminary tests CESNET (prague_cesnet_lcg2 ), BOLOGNA (igi-bologna) and NAPLES (UNINA-EGEE) But it seems that at CESNET 40 cores cannot be allocated for a single job – submitted using 16 cored
40 processors used on every system Performances Time to simulate 1 second in Domain1 (no nesting) during the first simulated hour 40 processors used on every system SUPERMIC@LRZ-LMU (s) IGI-BOLOGNA MPI-40 cores in 2 nodes Ethernet (s) UNINA-EGEE MPI - 40 cores in 8nodes-Infiniband (s) AVG 0.68 1.90 1.98 MIN 0.66 1.72 1.84 MAX 7.3 7.4 8.4 Writing operations 0.30s using 80 processors 0.20s using 120 processors