Overview of applications view of the job management middleware

Slides:



Advertisements
Similar presentations
DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Job Submission The European DataGrid Project Team
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
EGEE is a project funded by the European Union under contract IST Workload management system testing : - what exists - further testing planned.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.
13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
M. Sgaravatto – n° 1 Overview of WP1 Workload Management System in EDG 2.x Massimo Sgaravatto INFN Padova - DataGrid WP1
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
Testing the HEPCAL use cases J.J. Blaising, F. Harris, Andrea Sciabà GAG Meeting April,
WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
WP1 Status and plans Francesco Prelz, Massimo Sgaravatto 4 th EDG Project Conference Paris, March 6 th, 2002.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
User Interface UI TP: UI User Interface installation & configuration.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
EGEE is a project funded by the European Union under contract IST LCG open issues Massimo Sgaravatto INFN Padova JRA1 IT-CZ cluster meeting,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
Servizi core INFN Grid presso il CNAF: setup attuale
Service Availability Monitoring
LCG and Glite open issues Massimo Sgaravatto INFN Padova
Gri2Win: Porting gLite to run under Windows XP Platform
Grid2Win Porting of gLite middleware to Windows XP platform
Information System testing for LCG-1
Grid Computing: Running your Jobs around the World
The EDG Testbed Deployment Details
Classic Storage Element
U.S. ATLAS Grid Production Experience
Use of Nagios in Central European ROC
WP1 WMS release 2: status and open issues
Practical: The Information Systems
Summary on PPS-pilot activity on CREAM CE
Preview Testbed Massimo Sgaravatto – INFN Padova
GDB 8th March 2006 Flavia Donno IT/GD, CERN
BDII Performance Tests
Comparison of LCG-2 and gLite v1.0
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Accounting at the T1/T2 Sites of the Italian Grid
The CREAM CE: When can the LCG-CE be replaced?
Grid2Win: Porting of gLite middleware to Windows XP platform
Job Submission in the DataGrid Workload Management System
Sergio Fantinel, INFN LNL/PD
Introduction to Grid Technology
Update on gLite WMS tests
LCG-1 Status & Task Forces
Short update on the latest gLite status
Scalability Tests With CMS, Boss and R-GMA
Gri2Win: Porting gLite to run under Windows XP Platform
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Long term job submission and monitoring uing grid services
Report on GLUE activities 5th EU-DataGRID Conference
The GENIUS portal and the GILDA t-Infrastructure
Site availability Dec. 19 th 2006
Job Submission M. Jouvin (LAL-Orsay)
The LHCb Computing Data Challenge DC06
Presentation transcript:

Overview of applications view of the job management middleware WP1-2/Applications joint session Heidelberg, September 26, 2003 Mario Reale mario.reale@roma1.infn.it

Summary EDG and LCG provided GRID demonstrators EDG 2.0.3 application testbed / EDG 2.0 devel testbed LCG-1 certification testbed and prototypal LCG-1 GRID computing service test code sources and on going application activities test results on workload management system most relevant currently opened bugs outstanding general issues/concerns conclusions

EDG development and application testbeds EDG 2.0 devel testbed ( Jul-Aug 2003, 2.0.0 tagged on Aug 28 ) EDG dev core sites : CERN, RAL, NIKHEF, LAL, CNAF, LYON RB in Lyon, Top R-GMA at RAL Many temporary development tags July-Aug ( “t20030822_1000” ) EDG 2.0.3 application evaluation testbed (mid Sept,…growing) Currently 11 sites ( IFAE,RAL,LYON,NIKHEF,CT,PD,RM,FE,UCL,IC,FZK) RB at NIKHEF, Top R-GMA at RAL

LCG CERN Certification testbed set up (as of Sept 8, 2003 ) LCG-CT 6 CERN-hosted clusters of Linux 7.3 PCs ( 7 different CEs ) Deploying prototypal LCG-1 (target release expected nov 03): VDT 1.1.8-9 (Globus 2.2.4) EDG 2.0 Workload MS (RB..) and Data MS (LRC, RLS..) DataTAG GLUE Schema v 1.1 LCG-specific modifications, adds and fixes to: LRMS ( PBS, Condor, LSF) Globus Gatekeeper MDS – BDII LDAP Various Globus components Reflecting into 45 different commands in total for both end users and administrators Deploying a prototypal LCG-1 tag ( by our test’s time – (beg Sept))

LCG-CT Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6 UI_1 UI_4 RB_a RB_b RB_3 LCG-CT BDII_a BDII_b BDII_3 CertTB (as of Sept 8, 2003 ) MDS_a MDS_b MDS_3_a CE_a CE_2 MDS_3_b CE_4 CE_5 CE_6 CE_b SE_a SE_2 SE_4 WNs WNs CE_3 WNs WNs WNs WNs WNs WNs WN_b1 WN_a1 WNs WNs WNs WNs WNs WNs WNs WNs SE_3 WNs WNs WNs WNs WN_b2 WN_2_a1 WNs WNs WNs WN_a2 WN_4 WN_5 WN WNs WNs WN_2_a2 WNs WN_3_a2 LCFGng Lite install RLS_MySQL Condor LSF WN_3_a1 Proxy RLS_oracle from Zdenek Sekera/ EDG PTB Aug 26 03

Prototypal LCG-1 GRID service Prototypal LCG-1 LCG GRID computing service Currently 7 core sites – (3 different continents): CERN, KFKI, FZK, CNAF, FNAL, SINICA, RAL Truly World-Wide extended GRID service for LHC Computing Deploying LCG1-1_0_0 - Rolled out on Sept 1, 2003. RB at CERN and CNAF, Top-MDS at CERN 4 main application VOs supported (Alice,ATLAS,CMS,LHCb) Opened to LCs since Sept 15, to experiments this week Not yet all experiment-VO s/w installed in all sites

GRID ICE http://tbed0116.cern.ch/gridice/vo/vo.php?VOname=cms

LCG-1 vs EDG-2 DMS ROS is not foreseen in LCG VOMS is not yet deployed (not even in EDG-2) WMS-DGAS Accounting System is not deployed yet (not even in EDG-2) InfoSys R-GMA is not foreseen in LCG-1 ( using MDS-BDII)

WMS commands DMS commands ( InfoSys commands ) SiteAdmin and Fabric edg-brokerinfo edg-lrc-admin edg-service-mysql-setup.pl edg-fetch-crl edg-mkgridmap edg-service-publisher-gen.pl edg-find-resources edg-mkgridpool edg-template-conf.pl edg-fmon-agent edg-replica-manager edg-testbed-test edg-fmon-server edg-replica-manager-configure edg-vo-env edg-gridftp-exists edg-replica-metadata-catalog edg-voms-admin edg-gridftp-ls edg-replica-metadata-catalog-admin edg-voms-admin-local edg-gridftp-mkdir edg-replica-optimization edg-voms-ldap-sync edg-gridftp-rename edg-replica-optimization-admin edg-voms-mkgridmap edg-gridftp-rm edg-rgma edg-voms-proxy-init edg-gridftp-rmdir edg-rgma-config edg-wl-dgas-hlr-resourceInfoClient edg-job-attach edg-rgma-db-setup edg-wl-dgas-hlr-transInfoClient edg-job-cancel edg-rgma-munge-schema edg-wl-dgas-hlr-userInfoClient edg-job-get-chkpt edg-rgma-pulse edg-wl-grid-console-shadow edg-job-get-logging-info edg-rm edg-wl-logev edg-job-get-output edg-rmc edg-job-list-match edg-rmc-admin edg-job-status edg-ros edg-job-submit edg-ros-admin edg-local-replica-catalog edg-ros edg-local-replica-catalog-admin edg-se-webservice edg-lrc ldapsearch –h .. –p … WMS commands DMS commands ( InfoSys commands ) SiteAdmin and Fabric MassStorage-SE Security-VOMS

Test code sources LCs July-Sept 03 LCG / EDG testing: Full general purpose generic-HEP-application test suite from Jean-Jacques Blaising: ( PERL ) Gets info on GRID current status (GRID config file) and creates JDL accordingly Submits jobs, monitor status, retrieve output, report results Various LC testing JDLs and scripts (/bin/bash, PERL ) general and intensive stress tests Official EDG WP6-SC (edg-site-certification) PERL-oo test suite Basic job submission, output retrieve Basic Data Management tests, Filename registration and file copy My Proxy tests Match-Making specific tests (http://marianne.in2p3.fr/datagrid/TestPlan/TESTSTATUS/ EDG_20_TEST_STATUS.html)

First very preliminary results from EDG 2.0.3 Application testbed Basic Job submission, monitoring, cancel, output retrieval works ok 8 streams x 50 medium-sized jobs storm showed a small percentage ( ~3 %) of Current Status: Aborted Status Reason: Cannot plan (a helper failed) or Current Status: Done (Cancelled) Status Reason: Cannot read JobWrapper output, both from Condor and from Maradona.

LCG-CT WMS tests and results (1/2) Tested basic job submission and output retrieve “Hello World” Test submission of 50 jobs with Input and Output sandboxes with and without retryCount=0 Test of 250 jobs with long sleep, no resubmission (MyProxy) 5 job with 2GB Input&Output sandboxes, no resubmission, with and without parallel streams Match-making tests: Three files on one SE, job matches to associated close CE Three file on one SE, add application tag, job matches close CE Access a file from a SE from job with file protocol; compute checksum [ OK ] [ OK ] [ OK ] [ OK ] [ OK ] [ OK ] [ OK ]

LCG-CT WMS tests and results (2/2) Submit JJ-HEP generic application on all available GRID CEs, retrieve output, register files on iteam VO LRC via RLS from the running script on the WN, copy them on close SE. [ Resources previously discovered by query to InfoSys ] Submit a numeric iterative calculation on all available GRID CEs, use BrokerInfo to find CloseSE and mount point, copy files there and register into iteam LRC through RLS [ OK ] [ OK ]

RB Stress Tests by Massive Job Submisison on the LCG-CT RB never crashed ran without problems at a load of 8.0 for several days in a row 20 streams with 100 jobs each ( typical error rate ~ 2 % still present ) RB stress test in a job storm of 50 streams , 20 jobs each : 50% of the streams ran out of connections between UI and RB. (configuration parameter – but machine constraints) Remaining 50% streams finished normal ( 2% error rate) Time between job-submit and return of the command (acceptance by the RB) is 3.5 seconds. ( independent of number of streams ) NOTE: RB interrogates all suitable CE's : wide area delay-killer (interactive work) ?

Preliminary full simulation and reconstruction tests with ALICE on the CERN LCG-CT ( beginning of Sept 03) Aliroot 3.09.06 (including HBT correl.) fully reconstructed events CPU-intensive, RAM-demanding (up to 600MB ,160MB average) ,long lasting jobs ( average 14 hours ) Outcome: > 95 % successful job submission, execution and output retrieval in a lightly loaded GRID environment ~ 95 % success (first estimate) in a highly job-populated testbed with concurrent job submission and execution ( 2 streams of 50 AliRoot jobs and concurrent 5 streams of 200 middle-size jobs) My Proxy renewal successfully exploited [ OK ]

Currently(Sept 19) opened EDG WMS bugs Currently(Sept 19) opened EDG WMS bugs ( also effecting LCG) (no show-stoppers) http://marianne.in2p3.fr/datagrid/bugzilla/ 1103 : considering working (already used) CEs in re-submission if no others available 1105 : mechanism should be provided to restart daemons that aren’t running anymore 1362 : hanging LB clients if LB responsive but other RB component down – should give appropriate error message, not hang 1465 : purger scripts queries the LB without using a proxy (su) 1643 : resubmission even when proxy expired 1716 : in some specific cases the LB doesn’t give the right job status 1792 : edg-job-status jobId-status max retrieval number (>1100..) 1798 : /etc/init.d/edg-wl-ns restart does not work (needs a sleep) 1770 : Jobs queues drain if the replica catalog is down

Currently(Sept 19) opened EDG WMS bugs: 18 Currently(Sept 19) opened EDG WMS bugs: 18 ( also effecting LCG) (no show-stoppers) http://marianne.in2p3.fr/datagrid/bugzilla/ 1824: delay to get information from edgbroker-info after NS and WL daemon restart 1918: tmpwatch erases wm files in /tmp - /var should be used 1932: edg-job-get-logging-info -o --logfile prints log-info to the screen ( same for edg-job-status, list-match ) 1934: use of commas for edg-job-* commands working but not advertised in help menu 1938: Complaining lob messages from edg-wl-interlogd 1125: Option –c : even if –c is useed, still UI seeks default config file ( 1843 ) : edg-find-resources doesn't have a --help option (UI, but not WP1 ) 1848: A replacement for the old getSelectedFile method is needed given Input File users gets URL/filename (BrokerInfo)

General issues and concerns (1/2) Some RGMA timeout wrongly set 1 site managed to block the whole GRID for 1 day since port 8080 of MonBox wasn’t accessible – It should be now fixed in RGMA GOUT empty quite often completely blocked all submissions Cryptic error messages like “A helper failed” Help user to tell expectable error messages (somebody switched off a given WN) from system services critical conditions – Work on going.. A mechanism should be provided for a sys admin to delete all jobs of a given user ( LCG-CT ) ( super user – ref bug 1465)

General issues and concerns (2/2) GUI & JDL editor not fully exploited and tested yet – ( ! ) MPI not fully exploited and tested yet – ( ! ) Check pointing not tested yet - ( ! ) Gang-matching not working at time it was tested ( 1442: should be fixed by now in 2.0.18, but EDG 2.0.3 includes 2.0.15-2 ) Old relevant pending bug : wrong ClassAd ( other.CEId ) could block the RB – ( now fixed in 2.0.18 ) Outbound connectivity for the Worker Nodes – issue for GDB

Conclusions Broker seems a lot more stable, but still has some failures, and has to be tested for scalability : this could be the next relevant issue ( RB polling all CEs…..) Many new features (check pointing, interactive jobs, gang matching, output data, access cost ranking, accounting, …) largely untested and some not integrated Looking forward for VOMS integration and VO-based users-resources matching