Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4) Maite.Barroso.Lopez@cern.ch.

Slides:

Advertisements

Similar presentations

TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.

Advertisements

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

19/06/2002WP4 Workshop - CERN WP4 - Monitoring Progress report

DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.

German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.

Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.

WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.

EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.

1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.

Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans

EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO

Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.

Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.

Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.

Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.

20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.

G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.

Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP

May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

Software Management with Quattor German Cancio CERN/IT.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster.

Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.

Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype

ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏

CERN 19/06/2002 Kickstart file generator Andrea Chierici (INFN-CNAF) Enrico Ferro (INFN-LNL) Marco Serra (INFN-Roma)

Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –

Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.

Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.

Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Hepix EDG Fabric Monitoring tutorial – n° 1 Introduction to EDG Fabric Monitoring Sylvain Chapeland.

Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN

WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO

System Monitoring with Lemon

StoRM: a SRM solution for disk based storage systems

Monitoring and Fault Tolerance

Status of Fabric Management at CERN

Germán Cancio CERN IT/FIO LCG workshop, 24/3/04

WP4 Fabric Management 3rd EU Review Maite Barroso - CERN

LEMON – Monitoring in the CERN Computer Centre

Miroslav Siket, Dennis Waldron

WP4-install status update

FTS Monitoring Ricardo Rocha

Status and plans of central CERN Linux facilities

German Cancio CERN IT .quattro architecture German Cancio CERN IT.

Quattor Usage at Nikhef

Module 01 ETICS Overview ETICS Online Tutorials

Gridifying the LHCb Monte Carlo production system

Presentation transcript:

Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4) Maite.Barroso.Lopez@cern.ch https://edms.cern.ch/document/376367/1

Talk Outline Introduction to EU DataGrid workpackage 4 Automated management of large clusters Components design and development status Summary and outlook Authors Olof Bärring, Maite Barroso Lopez, German Cancio, Sylvain Chapeland, Lionel Cons, Piotr Poznański, Philippe Defert, Jan Iven, Thorsten Kleinwort, Bernd Panzer-Steindel, Jaroslaw Polok, Catherine Rafflin, Alan Silverman, Tim Smith, Jan Van Eldik - CERN Massimo Biasotto, Andrea Chierici, Luca Dellagnello, Gaetano Maron, Michele Michelotto, Cristine Aiftimiei, Marco Serra, Enrico Ferro – INFN Thomas Röblitz, Florian Schintke – ZIB Lord Hess, Volker Lindenstruth, Frank Pister, Timm Morten Steinbeck – EVG UNI HEI David Groep, Martijn Steenbakkers – NIKHEF/FOM Paul Anderson, Tim Colles, Alexander Holt, Alastair Scobie, Michael George - PPARC

WP4 objective and partners “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” User job management (Grid and local) Automated management of large clusters 6 partners: CERN, NIKHEF, ZIB, KIP, PPARC, INFN. ~14 FTEs (6 funded by the EU). The development work divided into 6 subtasks:

Automated management of large clusters Put back node in queue Node malfunction detected Node OK detected Remove node from queue Node ready for intervention (no running jobs) Automation Inform fault tolerance Monitoring Resource Management Inform monitoring about all actions taken and their status Information Invocation Farm A (LSF) Farm B (PBS) Fault tolerance Repair (e.g. restart, reboot, reconfigure, …) Configuration Management Installation & Node Mgmt Update configuration templates if necessary Trigger repair

Monitoring subsystem: design Monitoring Sensor Agent Calls plug-in sensors to sample configured metrics Stores all collected data in a local disk buffer Sends the collected data to the global repository Transport Transport is pluggable. Two proprietary protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery Measurement Repository The data is stored in a database A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations. Plug-in sensors Programs/scripts that implements a simple sensor-agent ASCII text protocol A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors Monitored nodes Measurement Repository (MR) Database Sensor Monitoring Sensor Agent (MSA) Sensor Sensor Database Proprietary flat-file database Oracle MySQL interface being developed Repository API SOAP RPC Query history data Subscription to new data Consumer Cache Local Consumer The local cache Assures data is collected also when node cannot connect to network Allows for node autonomy for local repairs Consumer Consumer Global Consumer

Monitoring subsystem: status Local nodes: Monitoring Sensor Agent (MSA) and UDP based proprietary protocol are ready and used on CERN production clusters since more than a year. The TCP based proprietary protocol exists as prototype. Extensive testing needed to be ready for production use. Central services Repository server exists with both flatfiles and Oracle database. Support for MySQL is planned for this summer. Alarm display: still in early prototype phase. Repository API for local and global consumers: C library implementation of API (same for local and global consumers) Bindings for other languages can probably be generated directly from the WSDL

Fault tolerance subsystem: design Feedback to monitoring Actuator agent is reporting the result of the correlation and repair action (if any) to the monitoring system This feedback is important for tracing of all exceptions and repairs Decision unit Correlates the input data and checks for the exception conditions defined by the rules Local Node Launching of actuators If an exception was detected the actuator agent starts the recovery actuator specified by the rule. Sensor Sensor MSA The Rule : heart of the fault tolerance A rule Defines the exception conditions (problem space) for the input monitoring data Defines the association between exception conditions and recovery actions (actuators) Rules allows to define escalating responses („if that didn‘t work try this“) based on metrics obtained from monitoring. The data can be processed using a wide range of mathemathical and boolean expressions for correlating values. A rule is not restricted to the use of metrics that are collected locally. Sensor Decision Unit (DU) Actuator agent Actuator Cache Rules Monitoring Fault Tolerance daemon (FTd) Input data to fault tolerance system The rules define the required input data. The Fault Tolerance daemon(FTd) uses the monitoring repository API to access (query or subscribe to) monitoring measurements. Escalation If no Rule matches or repair not possible, the problem is escalated to the global FT service Ft global server

Fault tolerance subsystem: status Not yet ready for production deployment Prototype was demonstrated working together with the fabric monitoring system at EU review in February 2003 Web-based rule editor Central Rule repository (MySQL) Local FTd (fault tolerance daemon) that Automatically subscribes to monitoring metrics specified by the rules Launches the associated actuators when the correlation evaluates to an exception Reports back to the monitoring system the recovery actions taken and their status Global correlations not yet supported

Configuration Management subsystem: design Server Modules Provide different access patterns to Configuration Information Configuration Data Base (CDB) Configuration Information store. The information is updated in transactions, it is validated and versioned. Pan Templates are compiled into XML profiles Server Module SQL/LDAP/HTTP Server Module SQL/LDAP/HTTP GUI CDB Pan XML pan NVA AP I Installation ... Cache CLI CCM Node Pan Templates with configuration information are input into CDB via GUI & CLI HTTP + notifications nodes are notified about changes of their configuration nodes fetch the XML profiles via HTTP Configuration Information is stored in the local cache. It is accessed via NVA-API

Configuration Management subsystem: status System in implemented (except for CLI and Server Modules), most of the components in 1.0 production version, Pilot deployment of the complete system at CERN production clusters, using the “panguin” GUI (screenshot next slide) In parallel: System being consolidated, Issues of scalability and security being studied and addressed, Server Modules under development (SQL). More information: http://cern.ch/hep-proj-grid-config/

panguin GUI for managing/editing PAN templates (Courtesy: Martin Murth, CERN-IT/FIO)

XML profile generated by PAN for a typical node (lxplus001)

Installation Management: design Software Package Mgmt Agent (SPMA) SPMA manages the installed packages Runs on Linux (RPM) or Solaris (PKG) SPMA configuration done via an NCM component Can use a local cache for pre-fetching packages (simultaneous upgrades of large farms) SWRep Servers http SPMA cache packages (rpm, pkg) Packages Mgmt API nfs SPMA.cfg (RPM, PKG) ACL’s Automated Installation Infrastructure DHCP and Kickstart (or JumpStart) are re-generated according to CDB contents PXE can be set to reboot or reinstall by operator SPMA ftp SPMA NCM Components NCM Node (re)install? Software Repository Packages (in RPM or PKG format) can be uploaded into multiple Software Repositories Client access is using HTTP, NFS/AFS or FTP Automated Installation Infrastructure Cdispd PXE CCM PXE handling Mgmt API Registration Notification ACL’s Node Install DHCP Node Configuration Manager (NCM) Configuration Management on the node is done by NCM Components Each component is responsible for configuring a service (network, NFS, sendmail, PBS) Components are notified by the Cdispd whenever there was a change in their configuration DHCP handling KS/JS KS/JS generator Client Nodes CCM CDB

Installation subsystem: status Software Repository and SPMA First pilot being deployed on CERN Computer Centre for the central CERN production (batch & interactive) services Node Configuration Manager (NCM) Design/development phase Implementation available in Q2 2003 Automated Installation Insfrastructure (AII) Linux Implementation expected for Q2 2003

Summary & Future Work Experience and feedback with existing tools and prototypes helped to get requirements and early feedback from users First implementation now ready for all the subsystems Some of them already deployed at CERN and/or EDG testbed. The rest will come during this year. Close collaboration with CERN service managers and LCG What is still missing: General: Scalability, Security, GUIs Integration between the different fabric subsystems to build a consistent set of fabric management tools From prototype to production quality Thanks to the EU and our national funding agencies for their support of this work