1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

Slides:



Advertisements
Similar presentations
CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.
German Cancio – WP4 developments Partner Logo WP4-install plans WP6 meeting, Paris project conference
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Understanding and Managing WebSphere V5
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
EDG LCFGng: concepts Fabric Management Tutorial - n° 2 LCFG (Local ConFiGuration system)  LCFG is originally developed by the.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
LAL Site Report Michel Jouvin LAL / IN2P3
quattor NCM components introduction tutorial German Cancio CERN IT/FIO.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
SPMA & SWRep: Basic exercises HEPiX hands-on, NIKHEF 5/03 German Cancio
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Scientific Linux Inventory Project (SLIP) Troy Dawson Connie Sieh.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
System Monitoring with Lemon
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
LEMON – Monitoring in the CERN Computer Centre
Miroslav Siket, Dennis Waldron
Consulting Services JobScheduler Architecture Decision Template
WP4-install status update
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4)
Module 01 ETICS Overview ETICS Online Tutorials
Deploying Production GRID Servers & Services
Presentation transcript:

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003

20 October 2003Thorsten Kleinwort IT/FIO/FS 2 Contents Introduction to CERN’s Fabric Management: Concepts Framework for CERN’s Fabric Management: Tools Configuration Mgmt Software Mgmt State Mgmt Monitoring

20 October 2003Thorsten Kleinwort IT/FIO/FS 3 Concepts: The Node The Node is the manageable unit: Autonomous: Local configuration files Programs work locally No external dependencies No remote management scripts Adheres to LSB (Linux Standard Base): Init scripts /etc/init.d/, start daemons Logfile directory /var/log, logrotate Config directory /etc (System) Programs in /(s)bin/, /usr/(s)bin

20 October 2003Thorsten Kleinwort IT/FIO/FS 4 Concepts: Node -> Cluster Same functionality of nodes -> cluster (But not necessarily same HW) Management tools enforce uniform setup Cluster size varies: LXBATCH > 1000 nodes LXPLUS ~ 70 nodes LXMASTER (Batch master) = 2 nodes Critical servers replaced by service clusters with redundant nodes

20 October 2003Thorsten Kleinwort IT/FIO/FS 5 Concepts: Principles Software installs/updates through RPM Configuration through one tool Configuration information through one interface Configuration information stored centrally Installation, configuration and maintenance automated, but steerable Reproducibility

20 October 2003Thorsten Kleinwort IT/FIO/FS 6 Framework node Mon Agent Monitoring Manager Cfg Agent Config Manager Config Cache SW Agent SW Manager SW Cache Hardware Manager State Manager

20 October 2003Thorsten Kleinwort IT/FIO/FS 7 Framework node SW AgentCfg AgentMon Agent CDB Monitoring Manager SW Manager Hardware Manager State Manager CCM SW Cache

20 October 2003Thorsten Kleinwort IT/FIO/FS 8 Configuration (CDB & CCM) CDB (Configuration Data Base): Development of EU Data Grid (WP4) CDB is the configuration data base Now ~ 1500 nodes, ~ 15 clusters ~ 3200 configuration templates to describe the nodes Creates one (XML) profile per node All information that is needed to install & run the nodes now included Currently 2 Linux versions: RH 7.3 & ES 2.1

20 October 2003Thorsten Kleinwort IT/FIO/FS 9 CDB (cont’d) Additional Information to be added: (Merged from other sources) State information (->SMS) Monitoring information (->MSA) Vendor/Contract/Purchase information: Need for encryption to store secure data New, high level Interfaces are provided: “Add/Rename Node” Change node state

20 October 2003Thorsten Kleinwort IT/FIO/FS 10 CDB (cont’d) Local caching on the node CCM (Configuration Cache Manager): In test phase, deployed on a few nodes Runs local daemon, which is notified on modification of the nodes configuration information Avoids peaks on CDB web servers Beside XML profiles, new SQL interface: Allows SQL queries on CDB Needed for cross machine view (e.g. give me all nodes that belong to the cluster X)

20 October 2003Thorsten Kleinwort IT/FIO/FS 11 Framework node SPMACfg AgentMon Agent CDB Monitoring Manager SWRep Hardware Manager State Manager CCM SWRep Cache

20 October 2003Thorsten Kleinwort IT/FIO/FS 12 Software distribution (SPMA & SWRep) SPMA (Software Package Management Agent): Development of EU Data Grid (WP4) The tool to install all software on the nodes Uses RPM for SW distribution on Linux Version for Solaris PKG package manager exists We install between 700 – 1000 RPMs per node Based on RPMT (Enhancement of RPM) Crucial part of the framework

20 October 2003Thorsten Kleinwort IT/FIO/FS 13 SPMA (cont’d) SPMA runs on every node (on demand) Can manage either a subset or all packages: We manage all packages on all clusters but one, which is for development Missing packages are added and Unknown packages are removed Package list created from CDB, but SPMA is independent of CDB SPMA allows to roll back versions

20 October 2003Thorsten Kleinwort IT/FIO/FS 14 SPMA & SWRep SWRep (Software Repository): Client-Server tool suite for storage of software packages Universal: Linux RPM/Solaris PKG Multiple versions: RH 7.3, RH ES 2.1, RH 10 Management interface: ACL mechanism to add packages Package list automatically kept up-to-date in CDB

20 October 2003Thorsten Kleinwort IT/FIO/FS 15 SPMA & SWRep (cont’d) Addresses Scalability: HTTP as SW distribution protocol Load balanced server cluster SPMA run is randomly time delayed within 10 minutes Pre-caching of SW packages on the node possible Currently installed on 1500 nodes

20 October 2003Thorsten Kleinwort IT/FIO/FS 16 Framework node SPMANCMMon Agent CDB Monitoring Manager SWRep Hardware Manager State Manager CCM SWRep Cache

20 October 2003Thorsten Kleinwort IT/FIO/FS 17 Configuration Tool (NCM) NCM (Node Configuration Manager): Local configuration tool EU Data Grid (WP4) development First components have been (re-)written and are tested on production nodes Uses CDB for configuration information Has its first public release: We have to transform all our SUE features into NCM components (~50) Plan is to do this while migrating to next Linux release

20 October 2003Thorsten Kleinwort IT/FIO/FS 18 Framework node SPMANCMMSA CDBOraMonSWRep CCM SWRep Cache Hardware Manager State Manager

20 October 2003Thorsten Kleinwort IT/FIO/FS 19 Monitoring (MSA & OraMon) LEMON (LHC Era Monitoring): EU Data Grid (WP4) development Client (MSA): ~ 100 metrics are measured Deployed on > 1500 nodes (more than currently managed by CDB) Configuration to be put into CDB Server (OraMon): ORACLE database as back end Stores current values as well as history User API (in C, PERL, PHP, TCL) in test phase

20 October 2003Thorsten Kleinwort IT/FIO/FS 20 Framework node SPMANCMMSA CDBOraMonSWRep HMSSMS CCM SWRep Cache

20 October 2003Thorsten Kleinwort IT/FIO/FS 21 State Management (SMS & HMS) LEAF (LHC Era Automated Fabric): HMS (Hardware Management System), controls & tracks: Node installation Node Move & reinstall (rename) Node retirement Node repairs (Vendor calls) Remedy Workflow Application Will interface to CDB

20 October 2003Thorsten Kleinwort IT/FIO/FS 22 HMS & SMS SMS (State Management System): Allows to set node states (in CDB) Validates state transition Handles new machine arrivals (~400 in Nov) Uses SOAP to interface to CDB Working prototype

20 October 2003Thorsten Kleinwort IT/FIO/FS 23 node Tools: SPMANCMMSA CDBOraMon SWRep CCM SWRep Cache HMSSMS QUATTOR LEMON LEAF =++

20 October 2003Thorsten Kleinwort IT/FIO/FS 24 Tools: Examples Batch System LSF: Upgrade 4.2 -> 5.1 on > 1000 nodes within 15 min, without stopping batch (with pre-caching) Kernel Upgrade: SPMA can handle multiple versions of the same package: Allows to separate installation and reboot of new kernel in time Security upgrades: All security upgrades are done by SPMA (~once a week): SSH Security upgrade KDE upgrade (~400 MB per node)

20 October 2003Thorsten Kleinwort IT/FIO/FS 25 References EU Data Grid: EDG WP4: QUATTOR web page: LEMON web page: LEAF web page: CERN IT/FIO: