Download presentation
Presentation is loading. Please wait.
Published byAbigail Stewart Modified over 8 years ago
1
FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004 Analysis of WP4 Fault Tolerance Framework FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004
2
Fault Tolerance Framework Objectives “ Provide functionality to improve reliability and reduce operating cost by providing means to perform as many repair and maintenance functions as possible automatically….” FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004
3
Fault Tolerance Framework General Architecture FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004 node_A@cernnode_B@cern fmon server fault toleranceMSA SOAPSOAP sensors actuators store metrics subscribe metrics
4
FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004 Fault Tolerance Framework Architecture rule_A.xml rule_B.xml rulescanedg-ftd actuator_A actuator_B subscribe to metrics launches actuators logs results checks syntax parses rules rules translated to reverse polish notation
5
FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004 Fault Tolerance Framework Rule ps /bin/ lxdev04 9501 eq 1 xml file used to define an FT action sample values of integer, string and boolean types applies to the latest sampled values result must be always a boolean runned depending on frequency of sampling shell for the actuator actuator nameargument for the actuatoractuator pathnode identifiermetric identifiermathematical operator:==, !=, +, -, /,*, (,), gte, gt, lt, lte, eq, neq, not, and, &&, or, ||, string==, string!= example of expression: node_A.8002 > (node_A.8002/3 + node_B.9001) specifies the hierarchy of actuators to be launched to escalate the fix of a problem values used to compare with the values returned by the monitoring system metric subscription
6
FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004 Fault Tolerance Framework Tests startup script syntax error checking in the rule file reconnection of edg-ftd when stopping/restarting monitoring server starting edg-ftd with monitoring server down launching of actuators - set of rules provided by the package testing the operators - actuator responsible for cleaning /tmp partition and /pool partition - restart of crond daemon - rule with large number of operands - rule combining metrics from different nodes - rule combining metrics with different frequencies insertion of actuators return value as a metric in the monitoring server large number of rules
7
FIO Fault Tolerance Workshop Hugo Caçote @ 08/01/2004 Fault Tolerance Framework Analysis Expected functionality progressively added in the last months Frequently upgrades revealed new problems ( increase stack size, startup script errors, gcc version ……… ) Lot of work testing the new features that were being introduced No version of the fault tolerance tested for a long period of time Knowledge acquired on the functionality of the framework No knowledge on the framework internals Code reuse difficult
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.