Computation of Service Availability Metrics in Gridview Digamber Sonvane, Rajesh Kalmady, Phool Chand, Kislay Bhatt, Kumar Vaibhav Computer Division, BARC,

Slides:

Advertisements

Similar presentations

Characteristics of Statistical Data

Advertisements

Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)

1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 18, Slide 1 Chapter 18 Confidence Intervals for Proportions.

SE 450 Software Processes & Product Metrics Reliability: An Introduction.

Topics: Inferential Statistics

Monitoring and Pollutant Load Estimation. Load = the mass or weight of pollutant that passes a cross-section of the river in a specific amount of time.

Ch. 3.1 – Measurements and Their Uncertainty

Inferences About Process Quality

Readiness Index – Is your application ready for Production? Jeff Tatelman SQuAD October 2008.

LEVEL OF MEASUREMENT Data is generally represented as numbers, but the numbers do not always have the same meaning and cannot be used in the same way.

 There are times when an experiment cannot be carried out, but researchers would like to understand possible relationships in the data. Data is collected.

Statistical inference: confidence intervals and hypothesis testing.

Dan Piett STAT West Virginia University

Random Sampling, Point Estimation and Maximum Likelihood.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Task Scheduling By Dr. Amin Danial Asham.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EG recent developments T. Ferrari/EGI.eu ADC Weekly Meeting 15/05/

INTRODUCTION TO SCIENCE Chapter 1 Physical Science.

LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.

CCNA1 v3 Module 2 W04 – Sault College – Bazlurslide 1 Accuracy vs. Precision.

LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.

Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.

Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.

James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

1 A paper by Yi-Bing Lin IEEE Transactions on Mobile Computing Vol. 4, No. 2, March/April ’05 Presented by Derek Pennington Per-User Checkpointing For.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Feedback on SAM from SA1 site representatives.

Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,

Copyright © 2009 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.

Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21.

Presented By, Shivvasangari Subramani. 1. Introduction 2. Problem Definition 3. Intuition 4. Experiments 5. Real Time Implementation 6. Future Plans 7.

Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.

WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.

Schedule Accuracy Score My Team. Schedule Accuracy Schedule Accuracy was developed to Increase the usefulness and reliability of the Dispatch Portal Reduce.

ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Validation of SAM3 monitoring data (availability & reliability of services) Ivan Dzhunov, Pablo Saiz (CERN), Elena Tikhonenko (JINR, Dubna) April 11, 2014.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –

1 Section 8.2 Basics of Hypothesis Testing Objective For a population parameter (p, µ, σ) we wish to test whether a predicted value is close to the actual.

Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions 5-4 Mean, Variance and Standard Deviation.

GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.

SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.

Chapter 7: The Distribution of Sample Means

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

SUM like functionality with WLCG-MON Ivan Dzhunov.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI 2 nd level support training Marian Babik, David Collados, Wojciech Lapka,

Tailoring the ESS Reliability and Availability needs to satisfy the users Enric Bargalló WAO October 27, 2014.

Flexible Availability Computation Engine for WLCG Rajesh Kalmady, Phool Chand, Vaibhav Kumar, Digamber Sonvane, Pradyumna Joshi, Vibhuti Duggal, Kislay.

L Berkley Davis Copyright 2009 MER301: Engineering Reliability Lecture 8 1 MER301: Engineering Reliability LECTURE 8: Chapter 4: Statistical Inference,

Daniele Bonacorsi Andrea Sciabà

Pedro Andrade ACE Status Update Pedro Andrade

Dynamic Fine-Grained Localization in Ad-Hoc Networks of Sensors

Evolution of SAM in an enhanced model for monitoring the WLCG grid

Service Level Agreement/Description between CE ROC and Sites

Report on SLA progress Ioannis Liabotis <ilaboti at grnet.gr>

Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.

March Availability Report for EGEE Sites based on Nagios

Risk Management with Minimum Weight

EGEE Operation Tools and Procedures

Site availability Dec. 19 th 2006

Presentation transcript:

Computation of Service Availability Metrics in Gridview Digamber Sonvane, Rajesh Kalmady, Phool Chand, Kislay Bhatt, Kumar Vaibhav Computer Division, BARC, India

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Computation of Availability Metrics Gridview computes Service Availability Metrics per VO using SAM test results Computed Metrics include  Service Status  Availability  Reliability Above Metrics are computed  per Service Instance  per Service (eg. CE) for a site  per Site  Aggregate of all Tier-1/0 sites  Central Services  over various periodicities like Hourly, Daily, Weekly and Monthly

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Metric Definitions Status of a service instance, service or a site is the state at a given point in time  For eg. UP, DOWN, SCHEDULED DOWN or UNKNOWN Availability of a service instance, service or a site over a given period is defined as the fraction of time the same was UP during the given period Reliability of a service instance, service or a site over a given period is defined as the fraction of time the same was UP (Availability), divided by the scheduled availability (1 – Scheduled Downtime – Unknown Interval) for that period  Reliability = Availability / (1 - Scheduled Downtime – Unknown)  Reliability = Availability / (Availability + Unscheduled Downtime) Reliability definition approved in LCG MB Meeting,13th Feb, 2007

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Metric Computation Example Consider Status for a day as  UP – 12 Hrs  Scheduled Down – 6 Hrs  Unknown – 6 Hrs Availability Graphs (1 st bar in Graph) would show  Availability (Green) – 50 %  Sch. Down (Yellow) – 25 %  Unknown (Grey) – 25 % Reliability Graph (1 st bar in Graph) would show 100% Reliability = Availability(Green) / (Availability(Green)+ Unscheduled Downtime(Red)) Reliability not affected by Scheduled Downtime or Unknown Interval Sample Reliability Graph Sample Availability Graph

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, New Algorithm Old (Current) algorithm  Computes Service Status on Discrete Time Scale with precision of an hour  Test results sampled at the end of each hour  Service status is based only on latest results for an hour  Availability for an hour is always 0 or 1 Drawbacks of Discrete Time Scale  Test may pass or fail several times in an hour  not possible to represent several events in an hour  loss of precise information about the exact time of occurrence of the event New Algorithm  Computes Service Status on Continuous Time Scale  Service Status is based on all test results  Computes Availability metrics with higher accuracy  Conforms to Recommendation 42 of EGEE-II Review Report about introducing measures of robustness and reliability  Computes reliability metrics as approved by LCG MB, 13 Feb 2007

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Major Differences Major differences between old (current) and new algorithm  Service Status computation on Continuous time scale  Consideration of Scheduled Downtime (SD) Service may pass tests even when SD Leads to Inaccurate Reliability value New algorithm ignores test results and marks status as SD  Validity of Test Results 24 Hours for all tests in old case Defined by VO seperately for each test in New method Invalidate earlier results after scheduled interruption  Handling of UNKNOWN status

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Handling of UNKNOWN Status Old (Current) Algorithm  Computations are based merely on available parameters  Parameters for which values are not known are ignored New Algorithm  Takes all essential parameters into consideration  Status explicitly marked “UNKNOWN” when values of some essential parameters are not known Affects  Service Instance Availability  Site Availability

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Computation of Service Instance Status VO Marks critical tests for each service Service should be considered UP only if all the critical tests are passed Old (Current) Algorithm  Marks status as UP even if test result of only one of the many critical tests is available and UP  Ignores the status of those critical tests whose results are not known New Algorithm  marks status as UNKNOWN even if the status of all known critical tests is UP and one of the critical tests is UNKNOWN  When any of the known results is DOWN, status is marked as DOWN anyway

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Computation of service instance status : Difference between old (current) and new algorithm All results available and tests passed. No difference between old and new algorithms Two tests failed but old algorithm shows service UP because only final tests in the hour are considered One test failed at the end of the hour but old algorithm shows service DOWN because only final tests in the hour are considered Service instance is marked as scheduled shutdown. Result of test4 is unknown. The old algorithm ignores this, whereas the service instance is marked ‘unknown’ in the new algorithm All earlier test results are invalidated and treated as unknown at the end of a scheduled shutdown in the new algorithm. Test result considered invalid after a VO- specific timeout Coarse granularity: Status change only at hour boundaries Finest granularity: Status can change at any moment, instantaneously Scheduled Downtime test 1 test 2 test 3 test 4 Old algo New algo In the new algorithm, status is based on all the test results and is more accurate representation of the true state of the service Up Down Unknown Scheduled Down Status UP as tests are passed even when scheduled Down, results in inaccurate reliability value

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Discovery of Service Instances Old (Current) Algorithm  identifies service instances based on available test results  computes status of only those services for which at least one critical test is defined and result of at least one critical test is available  leads to unexpected behavior like a service instance suddenly disappearing from Gridview’s monitoring page New Algorithm  tracks service instances based on their registration in Information System, either in BDII or GOCDB  computes status of all service instances supporting the given VO irrespective of whether critical tests are defined or results available

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Computation of Site Status Status of a service Eg. CE is computed by ORing statuses of all its instances Status of the Site is computed by ANDing statuses of all registered site services Status of a site should be UP only if the status of all registered site services is UP Old (Current) Algorithm  Marks status of site as UP even if status of only one of the many registered services is known and UP  Ignores those services for which the status is not known New Algorithm  Marks status of site as UNKNOWN even if the status of all known services is UP and one of the services is UNKNOWN  When any of the known services is DOWN, status is marked as DOWN anyway

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Service & Site Service Status Calculation test status per (test, si, vo) Test Results Service Instance Status ServiceStatusSiteStatus aggregate test status per (si, vo) Service = a service type (e.g. CE, SE, sBDII,...) Serviceinstance (si) = (service, node) combination consider only critical tests for a vo ANDing Service marked as scheduled down (sd)  sd all test statuses are ok up at least one test status is down(failed)  down No test status down and at least one test status is unknown unknown aggregate service instance status for site services per (site, service, vo) ORing At least one service instance status up up No instance up and at least one is sd sd No instance up or sd and at least one instance is down  down All instances are unknown  unknown aggregate site service status per (site, vo) ANDing all service statuses up up at least one service status down down no service down and at least one is sd  sd no service down or sd and at least one is unknown  unknown (new algorithm)

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Availability and Reliability Computation Hourly Availability, Scheduled Downtime and Unknown Interval for a service instance, service and site are computed from the respective status information Hourly Reliability is computed from the Availability, Scheduled Downtime and Unknown Interval for that hour Daily, Weekly and Monthly Availability, Scheduled Downtime and Unknown Interval computed by averaging Hourly figures Daily, Weekly and Monthly Reliability figures are computed directly from the Availability, Scheduled Downtime and Unknown Interval figures over the corresponding periods

Computation of Service Availability Metrics in Gridview PMB Meeting, 18 Sep, Conclusions Old (Current) Algorithm  coarse grain representation of service status  arrives on a conclusion based on incomplete data  can be misleading, giving the impression that everything is ok when the fact is not known  may adversely affect actual availability of the service as it might be hiding some of the service breakdowns rather than generating alerts New Algorithm  resolves ambiguities present in the old (current) algorithm  generates true status of the service, sincerely stating it as UNKNOWN whenever it doesn't have adequate data to establish the state of the service  Fine Grain Model, Computes service availability metrics with more accuracy  Implementation is ready, we are awaiting management nod for deployment

Thank You Your comments and suggestions please