DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
8 July 2003 – Towards automation of computing fabrics – n° 2 Talk Outline u WP4 objective and partners u Automated management of large clusters u Fabric Monitoring u Fabric Fault Tolerance
8 July 2003 – Towards automation of computing fabrics – n° 3 WP4 objective and partners “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” User job management (Grid and local) Automated management of large clusters u 6 partners: CERN, NIKHEF, ZIB, KIP, PPARC, INFN. u The development work divided into 6 subtasks: WP4 Configuration Mgt Installation Mgt Monitoring Fault Tolerance Resource MgtGridification
8 July 2003 – Towards automation of computing fabrics – n° 4 Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage
8 July 2003 – Towards automation of computing fabrics – n° 5 Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials. - Interface between Grid-wide services and local fabric; - Provides local authentication, authorization and mapping of grid credentials.
8 July 2003 – Towards automation of computing fabrics – n° 6 Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt (WP2) Grid Data Storage - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting). - provides transparent access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).
8 July 2003 – Towards automation of computing fabrics – n° 7 Installation & Node Mgmt Fabric mgt subsystems Other services Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Resource Broker Data Mgmt Grid Data Storage (WP5) - provides the tools to install and manage all software running on the fabric nodes; -Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories. - provides the tools to install and manage all software running on the fabric nodes; -Agent to install, upgrade, remove and configure software packages on the nodes. -bootstrap services and software repositories.
8 July 2003 – Towards automation of computing fabrics – n° 8 Fabric mgt subsystems Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services Other services Resource Broker Data Mgmt Grid Data Storage -provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information. -provides a central storage and management of all fabric configuration information; - central DB and set of protocols and APIs to store and retrieve information.
8 July 2003 – Towards automation of computing fabrics – n° 9 Architecture logical overview Farm A (LSF)Farm B (PBS ) Grid User (Mass storage, Disk pools) Local User Installation & Node Mgmt Configuration Management Monitoring & Fault Tolerance Fabric Gridification Resource Management Grid Info Services WP4 subsystems Other Wps Resource Broker Data Mgmt Grid Data Storage - provides the tools for gathering monitoring information on fabric nodes; -central measurement repository stores all monitoring information; - fault tolerance correlation engines detect failures and trigger recovery actions. - provides the tools for gathering monitoring information on fabric nodes; -central measurement repository stores all monitoring information; - fault tolerance correlation engines detect failures and trigger recovery actions.
8 July 2003 – Towards automation of computing fabrics – n° 10 Measurement Repository (MR) Monitored nodes Sensor Monitoring Sensor Agent (MSA) Cache Consumer Local Consumer Sensor Consumer Global Consumer Database Fabric Monitoring architecture
8 July 2003 – Towards automation of computing fabrics – n° 11 Monitored nodes u Sensors measure Metrics mainly locally: CPU utilization, network throughput, daemons status, etc. u MSA – Monitoring Sensor Agent: n collects sensors data, stores them in a local cache and sends them to the central repository n triggers measurements according to the configured schedule. u Local Cache: allows local consumers to have fast access to monitoring information. Data can be collected even if the node istemporally isolated from central repository. u Outgoing data consist of a value associated to a metric identifier (what), a timestamp (when), a target identifier (where), and an agent identifier (who).
8 July 2003 – Towards automation of computing fabrics – n° 12 Measurement Repository u Receives samples from all the nodes u Stores data: n Plain text DB n Oracle DB n MySQL DB being implemented u Follows data up to subscribers u Answers queries u Consumers can subscribe to metrics and receive notification when new samples are available
8 July 2003 – Towards automation of computing fabrics – n° 13 Fault Tolerance Architecture Sensor MSA Sensor Monitoring Fault Tolerance daemon (FTd) Cache Actuator Local Node Decision Unit (DU) Actuator agent Rules
8 July 2003 – Towards automation of computing fabrics – n° 14 Fault tolerance framework: status u Main features n Rule: defines the exception condition (using monitoring metrics) and its association with recovery actions (actuators) n Web-based rule editor n Central Rule repository (MySQL) n Local FTd (fault tolerance daemon) that s Automatically subscribes to monitoring metrics specified by the rules s Launches the associated actuators when the correlation evaluates to an exception s Reports back to the monitoring system the recovery actions taken and their status u Not yet deployed in production environment