Download presentation
Presentation is loading. Please wait.
Published byHester Henderson Modified over 9 years ago
1
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED
2
4 Dec 2007 Alessandro Di Girolamo 2 SAM Critical Tests: Current Status Now running standard OPS tests using ATLAS credentials (i.e. the original SAM tests run under the ATLAS VO) List of sites from GOCDB SE & SRM: put: lcg-cr using cern-prod LFC, files in SAM test directory get: lcg-cp from site to the SAM UI del: lcg-del - clean the catalog and the storage CE Check CA RPMs version Job Submission on a WN tests VO swdir (sw installation directory) LFC lfc-ls, lfc-mkdir FTS glite-transfer-channel-list, Information System configuration and publication
3
4 Dec 2007 Alessandro Di Girolamo 3 Work in progress We are developing and testing ATLAS-specific SAM tests in order to: monitor the availability of ATLAS critical Site Services verify the correct installation and the proper functioning of the ATLAS software on each site SE & SRM & CE endpoints definition: intersection between GOCDB and TiersOfATLAS (ATLAS specific sites configuration file with Cloud Model) different services and endpoints might need to be tested using different VOMS credentials ATLAS endpoints and paths must be explicitly tested (i.e. /dq2 area) the LFC of the Cloud (residing in the T1) is used
4
4 Dec 2007 Alessandro Di Girolamo 4 Development: Tests and Alarms SE & SRM (centrally from SAM UI): – put: lcg-cr with Cloud LFC, with and without using BDII infos – get: lcg-cp CE (job submitted on each ATLAS CE): – keep on running large part of OPS suite – for ATLAS Tier1 and Tier2: Check the presence of the required version of the ATLAS sw Compile and execute a real analysis job based on a sample dataset Test put/get to local storage via native protocols (dccp, rfcp …) Alarm system: SE / SRM / CE tests failing: site contact persons will be alerted via SAM Alarm System (mail and/or sms) Grid Services (FTS, LFC etc.) tests failing: alarms to Service responsible the ATLAS dedicated services (DDM, etc..) that use those services
5
4 Dec 2007 Alessandro Di Girolamo 5 Reliability & Availability results SAM Critical Tests not reliable for: – France: BDII configuration (ATLAS endpoint should be explicitly put) – NDGF/BNL: different service setup SAM Critical Tests last months failures: – FZK: real SRM failures. Problems under investigation with site responsible – SARA: (mainly) not scheduled network problems
6
4 Dec 2007 Alessandro Di Girolamo 6 To Do New ATLAS specific tests (now running in pre-production) will be more realistic for the Experiment Improve completeness of monitor informations Informations across TiersOfATLAS, GOCDB and BDII. ATLAS Cloud topology view Integration with Ganga Robot and other ATLAS tools Integration with the ATLAS dashboard
7
4 Dec 2007 Alessandro Di Girolamo 7 Backup slides …
8
4 Dec 2007 Alessandro Di Girolamo 8 SAM ATLAS SE (SRM) tests All SRM endpoints (v1 and v2) can be considered as SE: SE tests are sent to the list of SRM endpoints resulting from the intersection of ToA & GOCDB
9
4 Dec 2007 Alessandro Di Girolamo 9 SAM ATLAS SE (SRM) tests All SRM endpoints (v1 and v2) can be considered as SE: SE tests are sent to the list of SRM endpoints resulting from the intersection of ToA & GOCDB
10
4 Dec 2007 Alessandro Di Girolamo 10 SAM results on Gridmap Thks to CERN openlab / EDS Topology : Possibility to include ATLAS Cloud view, Possibility to change the metrics for the sites size The collaboration with the Gridmap developers is already started
11
4 Dec 2007 Alessandro Di Girolamo 11 Other SAM tests Many more tests, not critical, are running
12
4 Dec 2007 Alessandro Di Girolamo 12 Site Availability: T0/T1 Site Services Availability: Site Services X = CE, SE, SRM Down: if all services of type X of a site are Down Ok: if all services of type X are Ok Degraded: if some services of type X are Ok and other are Down Site BDII: Ok or Down by taking the status of the site BDII instance Site Availability: The AND of each single Site Services Availability
13
4 Dec 2007 Alessandro Di Girolamo 13 Site Availability: one example
14
4 Dec 2007 Alessandro Di Girolamo 14 Storage Space Monitor via SAM A specific SAM test could be sent on the VOBOXes to check storage disk space, as already done for the IT cloud
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.