Service Level metrics- Monitoring and Reporting

Service Level metrics- Monitoring and Reporting
8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Service Level Metrics - SC4 Workshop Mumbai: H.Renshall
Talk Overview Objectives Current metrics and uses Work in progress Visualisation Examples Timescale Discussion Points A synthesis of the work of teams at CERN/IT/GD, BARC, IN2P3, ASGC I have plundered slides from other talks, papers and screen dumps. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Program of work objectives
Provide high level (managerial) views of the current status and history of the Tier 0 and Tier 1 grid services Keep this simple and provide a first result quickly (end of 1Q 2006) Plan the views to be qualitative and quantitative so they can be compared to the LCG site Memoranda of Understanding service targets for each site. Satisfy the recent LHCC LCG review observation: ‘A set of Baseline Services provided by the LCG Project to the experiments has been developed since the previous comprehensive review. A measure of evaluating the performance and reliability of these services is being implemented.’ 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

The MoU Service Targets
These define the (high level) services that must be provided by the different Tiers They also define average availability targets and intervention / resolution times for downtime & degradation These differ from Tier to Tier (less stringent as N increases) but refer to the ‘compound services’, such as “acceptance of raw data from the Tier0 during accelerator operation” Thus they depend on the availability of specific components – managed storage, reliable file transfer service, database services, … An objective is to measure the services we deliver against the MoU targets Data transfer rates Service availability and time to resolve problems Resources available at a site (as well as measured usages) Resources specified in the MoU are cpu capacity (in KSi2K), tbyes of disk storage, tbytes of tape storage and nominal WAN data rates (Mbytes/sec) 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

MoU Minimum Levels of Service
Maximum delay in responding to operational problems Average availability measured on an annual basis Service stoppage Degradation … by more than 50% Degradation … by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 Centre during accelerator operation 12 hours 24 hours 99% n/a Networking service to the Tier-0 Centre during accelerator operation 48 hours 98% Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outside accelerator operation All other services – prime service hours 2 hour 4 hours All other services – outside prime service hours 97% Some of these imply weekend/overnight staff presence or at least availability

Tier 0 Resource plan from MoU
Similar tables exist for all Tier 1 sites. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Nominal MoU pp running data rates CERN to Tier 1 per VO
Centre ALICE ATLAS CMS LHCb Rate into T1 (pp) MB/s ASGC, Taipei - 8% 10% 100 CNAF, Italy 7% 13% 11% 200 PIC, Spain 5% 6.5% IN2P3, Lyon 9% 27% GridKA, Germany 20% RAL, UK 3% 15% 150 BNL, USA 22% FNAL, USA 28% TRIUMF, Canada 4% 50 NIKHEF/SARA, NL 23% Nordic Data Grid Facility 6% Totals 1,600 These rates must be sustained to tape 24 hours a day, 100 days a year. Extra capacity is required to cater for backlogs / peaks.

WLCG Component Service Level Definitions
Class Description Downtime Reduced Degraded Availability C Critical 1 hour 4 hours 99% H High 6 hours M Medium 12 hours L Low 24 hours 48 hours 98% U Unmanaged None Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%) Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%) Downtime defines the time between the start of a problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%) Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations. 99% means a service can be down up to 3.6 days a year in total. 98% means up to a week in total. None means the service is running unattended

Tier0 Services Service VOs Class SRM 2.1 All VOs C LFC global copy
LHCb LFC local copy ALICE, ATLAS H FTS ALICE, ATLAS, LHCb, (CMS) CE RB Global BDII Site BDII Myproxy VOMS HC R-GMA

Required Tier1 Services
VOs Class SRM 2.1 All VOs H/M LFC ALICE, ATLAS FTS ALICE, ATLAS, LHCb, (CMS) CE Site BDII R-GMA Many also run e.g. an RB etc.

WLCG Service Monitoring framework
Service Availability Monitoring Environment (SAME) - uniform platform for monitoring all core services based on SFT (Site Functional Tests) experience. Other data sources can be used. Three main end users (and use cases): project management - overall metrics Operators (Local and Core infrastructure Centres) - alarms, detailed information for debugging, problem tracking VO administrators - VO specific SFT tests, VO resource usage A lot of work already done: SFT and GStat are monitoring CEs and Site-BDIIs and will feed data into SAME GRIDVIEW is monitoring data transfer rates and job statistics A SAME Data schema has been established. There will be a common Oracle database with that of GRIDVIEW Basic displays in place (SFT reports, CIC-on-duty dashboard, GStat) (but with inconsistent presentation and points of access) Basic framework for metric visualization in GRIDVIEW is ready 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Current Site Functional Tests (SFT)
Submits a short-living batch job to all sites to test various aspects of functionality: site CE uses local WMS to schedule them on an available local worker node Security tests: CA certificates, CRLs, ... (work in progress) Data management tests: basic operations using lcg-utils on the default Storage Element (SE) and chosen “central” SE (usually at CERN - 3rd party replication tests) Basic environment tests: software version, BrokerInfo, CSH scripts VO environment tests: tag management, software installation directory + VO specific job submission and tests (maintained by VOs, example: LHCb dirac environment test) Job submission - CE availability: Can I submit a job and retrieve the output? What is not covered? Only CEs and batch farms are really tested - other services are tested indirectly: SEs, RB, top-level BDII, RLS/LFC, R-GMA registry/schema tested by using default servers whereas sites may have several Maintained by Operations Team at CERN (+several external developers) and run as a cron job on CERN LXPLUS (using AFS based LCG UI) Jobs submitted at least every 3 hours (+ on demand resubmissions) Tests have a criticality attribute. These can be defined by VOs for their tests using the FCR (freedom of choice for resources) tool. The production SFT levels are set by the dteam VO. Attributes are stored in the SAME database. Results displayed at

Description of current SFT tests (1 of 2)
If a test is defined as critical the failure causes the site to be marked as bad on the SFT report page. If the test is not critical only a warning message is displayed. Job Submission Symbolic name: sft-job Critical: YES Description: The result of test job submission and output retrieval. Succeeds only if the job finished successfully and the output was retrieved. WN hostname Symbolic name: sft-wn Critical: NO Description: Host name of the Worker Node where the test job was executed. This test is only for information. Software Version Symbolic name: sft-softver Critical: YES Description: Detect the version of software which is really installed on the WN. To detect the version lcg-version command is used and if the command is not available (very old versions of LCG) the test script checks only the version number of GFAL-client RPM. CA RPMs Version (Certificate Authority RPMs) Symbolic name: sft-caver Critical: YES Description: Check the version of CA RPMs which are installed on the WN and compare them with the reference ones. If for any reason RPM check fails (other installation method) fall back to physical files test (MD5 checksum comparison for all CE certs with the reference list). 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Description of current SFT Tests (2 of 2)
. BrokerInfo Symbolic name: sft-brokerinfo Critical: YES Description: Try to execute edg-brokerinfo -v getCE. R-GMA client (Relational Grid Monitoring Architecture) Symbolic name: sft-rgma Critical: NO Description: Test R-GMA client software configuration by executing rgma-client-check utility script. CSH test Symbolic name: sft-csh Critical: YES Description: Try to create and execute a very simple CSH script which dumps environment to a file. Fails if CSH script is unable to execute and the dump file is missing. Apel test (Accounting log parser and publisher) Symbolic name: sft-apel Critical: NO Description: Check if Apel is publishing accounting date for the site by using the command: rgma -c "select count(*) from LcgRecords" 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

SFT Tests can be aggregates
Test: Replication management using LCG tools Test symbolic name: sft-lcg-rm Critical: YES Description: This is a super-test that succeeds only if all of the following tests succeed. GFAL infosys Description: Check if LCG_GFAL_INFOSYS variable is set and if the top-level BDII can be reached and queried. lcg-cr to defaultSE Description: Copy and register a short text file to the default SE using lcg-cr commamnd. lcg-cp defaultSE -> WN Description: Copy the file registered in test 8-2 back to the WN using lcg-cp commamnd. lcg-rep defaultSE -> central SE Description: Replicate the file registered in test 8-2 to the chosen "central" SE using lcg-rep commamnd. 3rd party lcg-cr -> central SE Description: Copy and register a short text file to the chosen "central" SE using lcg-cr commamnd. 3rd Party lcg-cp central SE to WN Description: Copy the file registered in test 8-5 to the WN using lcg-cp commamnd. 3rd Party lcg-rep central SE to defaultSE Description: Replicate the file registered in test 8-5 to the default SE using lcg-rep commamnd. Replication Management using lcg-tools lcg-del from defaultSE Description: Delete replicas of all the files registered in previous tests using lcg-del command. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

SFT results display A CIC operational view – not intended to be a management overview 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

GRIDVIEW: A Visualisation Tool for LCG
An umbrella tool for visualisation which can provide a high level view of various grid resources and functional aspects of the LCG. To display a dash-board for the entire grid and provide a summary of various metrics monitored by different tools at various sites. To be used to notify fault situations to grid operations, user defined events to VOs, by site and network administrators to view metrics for their sites and VO administrators to see resource availability and usage for their VOs. GRIDVIEW is currently monitoring gridftp data transfer rates ( used extensively during SC3 ) and job statistics The Basic framework for metric visualization by representing grid sites on a world map is ready We propose to extend GRIDVIEW to satisfy our service metrics requirements. Start with simple service status displays of the services required at each Tier 0 and Tier 1 site. Extend to service quality metrics, including availability and down times, and quantitative metrics that allow comparison with LCG site MoU values. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Block Schematic of Gridview Architecture (courtesy of BARC)
Oracle Database at CERN Archiver Module (R-GMA Cmr) Presentation Module Data Analysis & Summarization LCG Sites and Monitoring Tools GUI & Visualization R-GMA 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

GRIDVIEW gridftp display: data rates from CERN to Tier 1 sites

GRIDVIEW Visualisation : CPU Status (Red busy, Green Free) Simulated Data 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

GRIDVIEW Visualisation : Fault Status (Green OK, Red Faulty, Height of cylinder indicates size of site) Simulated Data 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Work in Progress Implement in Oracle at CERN the SAME database schema for storage of SFT results recently agreed between CERN and IN2P3 Obtain raw test results as below for the required services on a site - with different service-dependent frequencies (cron job, configurable) - each result defined by: site, service node, VO, test name - each result gives: status, summary (values), details (log) Tests under Development or to be integrated in SAME are: SRM: Not clearly defined yet, but should be: - OK/FAILED for most typical operations LFC: - OK/FAILED for several tests like: (un)registering file, create/ remove/list directory. - probably also measurements of time to perform the operation FTS: Not clearly defined yet. The existing gridftp rates from GRIDVIEW will also be used in quantity/quality measures. CE: - OK/FAILED for number of tests in SFT (job submission, lcg-utils, rgma, software version, experiment specific tests) 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Other SFT Tests under development or to be integrated
RB: - "active" sensor - probe: OK/FAILED for a simple test job, time to get from "Submitted" to "Scheduled" state (performance) - "passive" sensor – a summary from the real jobs monitoring being developed in GridView. Can produce a quality metric from the job success rate – e.g. above 95% = normal, down to 80% = degraded, down to 50 = affected, down to 20% = severely affected, below 20% = failed. Top-level BDII: - OK/FAILED for BDII (try to connect and make a query) - response time to execute full query- generate a quality metric. Site-BDII: - as for Top-level BDII - additionally OK/FAILED for a number of sanity checks MyProxy: - OK/FAILED for several simple tests: store proxy, renew proxy - possibly response time measurements- generate a quality metric. VOMS: - OK/FAILED for voms-proxy-init and maybe few other operations 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

SAME database schema Test definition table 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

SAME database schema Test data table:The data table is where all the results produced by a tests and sensors can be stored. Moreover it will contain also the results for summary tests generated by GRIDVIEW Summarisation Module. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Responsibilities for new Tests
Extend the SFT tests to include all baseline Tier 0 and Tier 1 service components. Store the results in the SAME Oracle data base Coordination of sensors: Piotr Nyczyk (CERN) Service Responsible Class Comments SRM 2.1 Dave Kant C monitoring of Storage Elements t.b.d. LFC James Casey C/H Will be per VO present at a given site FTS FTS support Gridftp data rates will be in SAME framework CE Piotr Nyczyk monitored by SFT today RB job monitor exists (few modifications needed) Top-level BDII Min-Hong Tsai can be integrated with GStat Site BDII H monitored by GStat today – data will go into SAME framework Myproxy Maarten Litmaath VOMS Valerio Venturi R-GMA Laurence Field M

Proposed extensions to GRIDVIEW
Enhance the GRIDVIEW summarisation module to aggregate individual SFT results into service results Start with simple logical and of the success/failure of all the critical tests of a given service to define the overall success/failure of that service. Store status ‘success/fail/not known’ back in database with timestamp (not known allows for no result from some tests). Also Store quantitative results from SFT tests, and other sources, of the MoU defined resources and other interesting metrics. Cpu capacity Disk storage under SEs Batch job numbers in site queue, running, success rate (per VO) Down time of services failing SFT tests (including scheduled downtimes) 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Further extensions to GRIDVIEW
Allow for dependencies Network failures to site or in site Component masking – failed component causes others to appear to fail their tests Failure of part of the SFT service itself (based on 3 servers at CERN) Generate quality rating metrics Use percent of nominal MoU value for cpu capacity available, disk storage under SEs, average downtime per service per degradation level and averaged annual availability per service in date range to calculate quality levels. Have 4 to 5 quality levels for simple dashboard visualisation. Above 80% of MoU target = acceptable, display green 50% to 80% of MoU target = degraded, display yellow 20% to 50% of Mou target = affected, display orange 0% to 20% of Mou target = failing, display red Aggregate to global site quality – take the lowest value quality level or perhaps an importance weighted average. Use last few months average availability rather than whole year. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Extend GRIDVIEW visualisation
Add visualisation of a per site service status dashboard Screen for each site Simple red/(orange/yellow/)green button per service (LFC per VO) Click on service button to display service status history Links to full SFT reports Links to other GRIDVIEW displays (includes VO breakdowns) gridftp data rates from CERN Advanced job monitoring Display of quality metrics – instantaneous and drill down to history with time period selection Percentages of MoU targets Mean Times to repair for each service Average availability per service Screen showing global site quality for all Tier 0 and Tier 1 – instantaneous and drill down to history with time period selection 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Prototype CPU availability display
These metrics show the availability of resources (currently only CPUs) and sites which provide them. Availability means that a site which provides any resources passes all critical tests. The current list of critical tests includes all critical tests from SFT and additionally several essential tests from GStat. The metrics are presented as cumulated bar charts for averaged metrics and as cumulated line plots for the history. Plots show in red the number of CPUs or sites that were "unavailable", integrated over a given interval. For CPUs the total amount of CPUs published as being provided by a site is used if the site was failing any of the critical tests, and 0 if no critical tests failed. Plots show in green the number of CPUs or sites that were "available", integrated over a given interval. For CPUs the total amount of CPUs published as being provided by a site is used if the site was passing all critical tests, and 0 if any critical test failed. The metrics are presented for the whole grid and for individual regions. The daily value of the number of sites available is integrated over the day and a site with some downtime contributes the corresponding fractional availability. To convert to KSi2K we can use the worker node rating published by each CE (grid middleware assumes homogeneity under a CE so we need an average). Currently these values do not seem to be universally reliable. They could be compared with the post-job accounting data collected at RAL (assuming they are better calculated). 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Prototype sites cpu availability metric
Work in Progress Prototype sites cpu availability metric

CERN FIO Group Service View Project
Project to provide a web based tool that dynamically shows availability, basic information and/or statistics about IT services as well as dependencies between them. Targeted on local CERN services: CASTOR, Remedy (trouble tickets), AFS, Backup and Restore, production clusters. Service aggregation to understand dependencies e.g. CVS depends on AFS or a failed component makes other tests appear to fail. Will have quality thresholds – envisage at least four levels – fully available, degraded, affected, unavailable e.g 60-80% of batch nodes available is interpreted as a degraded batch service. Will include history displays and calculate service availabilities and down times. Simple displays of service level e.g. red, orange, yellow, green. GRIDVIEW summarisation could become an information provider. Projects following each others work looking for possible synergies. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

CERN FIO Group Service level views planning

Example visualisation from CERN AIS – multiple tests per service

Example visualisation from CERN windows services

Spider plot visualisation
Normalised quasi-independent metrics. All values reaching their targets would be a circle. Colour the plot by the quality level of the worst service. 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Spider plot including some history

OSG visualisation: Global health spider plot
From on 3 Feb 2006 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

OSG Visualisation: Site resources

Timescale Updated framework and schema (SAME): Available now for testing of integration of data sources Integration of new SFT tests – after CHEP06 so from end Feb as they become available Displays: GridView: we would like to produce first version of status dashboards in parallel with SFT test integration. Keep lightweight in view of FIO work. Planning a top level Tier 0 screen and one for all Tier 1 with drill down to individual Tier 1 sites and their individual SFT test histories FIO tool: expected at end March Investigate its use as a back end Goal is to have first version of dashboard displays available in production by the end of March 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Some discussion points
Can we improve the consistency of the various grid monitoring interfaces: GRIDVIEW access point: Grid Operations Centre access point Which includes links to Gstat via SFT reports via Gridice at Who should use which monitoring displays and for what ? Should we try to join the CERN FIO service view framework ? Should scheduled downtime be counted as non-available time ? How many levels of quality of service do we need ? What threshold values to define the quality of a service e.g. for Time for a bdii to respond to an ldap query Time for an RB to matchmake (find a suitable CE) Time for simple batch job to complete Time to register a new LFC entry What fraction of MoU targets is ‘acceptable’ How to properly handle a null result for a test – reuse last value ? How to automate calculation of average KSi2K rating under a CE ? 8/23/2019 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

Service Level metrics- Monitoring and Reporting

Similar presentations

Presentation on theme: "Service Level metrics- Monitoring and Reporting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Service Level metrics- Monitoring and Reporting

Similar presentations

Presentation on theme: "Service Level metrics- Monitoring and Reporting"— Presentation transcript:

Similar presentations

About project

Feedback