Download presentation
Presentation is loading. Please wait.
1
Site availability Dec. 19 th 2006
Re-use materials from Piotr Nyczyk, CERN IT/GD See presentation at last GDB
2
Service Availability Monitoring
SFT has been phased out Service Availability Monitoring framework (SAM) : Monitoring all grid services not only CE It is used in the validation process of sites and services It allows calculation of availability metrics SAM wiki : SAM portal : Service and Site status are recorded (several snapshots per day) Daily, weekly, monthly availability is calculated using integration (averaging) over the given period Availability metrics tools in development Official evaluation of T0 and T1 sites availability : T2 sites availability will came out soon F. Chollet/
3
Existing tests CE, gCE job submission - UI->RB->CE->WN chain
version of CA certificates installed (on WN!) version of software middleware (on WN!) broker info - checking edg-brokerinfo command UNIX shells environment consistency (BASH vs. CSH) replica management tests - using lcg-utils, default SE defined on WN and a selected “central” SE (3-rd party replication) accessibility of experiments software directory - environment variable, directory existence accessibility of VO tag management tools other tests: R-GMA client check, Apel accounting records F. Chollet/
4
Existing tests SE, SRM LFC FTS Standalone tests
storing file from the UI - using lcg-cr command with LFC registration getting file back to the UI - using lcg-cp command removing file - using lcg-del command with LFC de-registration LFC directory listing - using lfc-ls command on /grid creating file entry in /grid/<VO> area FTS checking if FTS is published correctly in the BDII channel listing - using glite-transfer-channel-list command with ChannelManagement service transfer test (in development): Standalone tests GSTAT RB VO specific tests as well JobWrapper tests in development… in discussion SAM tests are not reaching all WNS Simplified set of tests executed on WNs by a wrapper with each job grid Core scripts installed on CEs and WNs (will become part of the realesae) F. Chollet/
5
CE sensor Tests France Region, VO OPS
F. Chollet/
6
SE,SRM sensor Tests France Region, VO OPS
See results of put test in order to catch the problem F. Chollet/
7
Availability metrics - algorithm
∧ t ∈ CriticalTests TestResult (N,t) Status of node N = Status of site S = CE1 CE2 CEn SRM 1 SRM 2 SRM n site BDII AND OR Everything is calculated for each VO that defined critical tests in FCR N ∈ instances(C) Status (N) Status of central service C = ∨ ∧ = boolean AND ∨ = boolean OR F. Chollet/
8
SAM Deployment Standalone sensors CE, gCE, SRM, FTS sensors Metrics
(GridView, XSQL) SAM Portal CIC ROC report CIC on Duty Dashboard SFT Portal FCR SAM Server (Oracle DB) SFT Server (MySQL DB) Freedom of Choice for Resources Selection of VO Critical tests Test definition, sites and nodes information, test results RB monitor GStat SAM Submission SAM Admin SFT Admin SFT Submission Standalone sensors CE, gCE, SRM, FTS sensors F. Chollet/
9
Site Availability & Operation Metrics :
EGEE LCG Join effort LCG Metrics available from SAM portal: SAM Metrics calculation Official evaluation of T0 and T1 sites availability : EGEE Operations Metrics available from CIC portal (ROC Management section) Metrics computed from SAM results and scheduled downtime Taking into account RC (Resource Report)reports as well F. Chollet/
10
France Region availability from EGEE CIC portal
Tool improving but site availability available as well 01/09/06 – 30/11/06 F. Chollet/
11
Availability metrics - data export
F. Chollet/
12
Availability metrics for Tiers-2
LCG Metrics Very first attempt to export data Non official for Tiers-2 F. Chollet/
13
To be continued… Open issues but sites must be aware of availability metrics… All sensors have to be reviewed and fixed: check if tests reflect real usage avoid dependencies on central services and third party services if possible increase reliability of results (resistant to any other failures not related to site configuration) increase tests verbosity (make easier to find real problem - site debugging)) Missing sensor/tests have to be written Jobwrapper tests Way of reviewing and fixing availability metrics Metric calculation for aggregates of sites F. Chollet/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.