1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003
20 October 2003Thorsten Kleinwort IT/FIO/FS 2 Contents Introduction to CERN’s Fabric Management: Concepts Framework for CERN’s Fabric Management: Tools Configuration Mgmt Software Mgmt State Mgmt Monitoring
20 October 2003Thorsten Kleinwort IT/FIO/FS 3 Concepts: The Node The Node is the manageable unit: Autonomous: Local configuration files Programs work locally No external dependencies No remote management scripts Adheres to LSB (Linux Standard Base): Init scripts /etc/init.d/, start daemons Logfile directory /var/log, logrotate Config directory /etc (System) Programs in /(s)bin/, /usr/(s)bin
20 October 2003Thorsten Kleinwort IT/FIO/FS 4 Concepts: Node -> Cluster Same functionality of nodes -> cluster (But not necessarily same HW) Management tools enforce uniform setup Cluster size varies: LXBATCH > 1000 nodes LXPLUS ~ 70 nodes LXMASTER (Batch master) = 2 nodes Critical servers replaced by service clusters with redundant nodes
20 October 2003Thorsten Kleinwort IT/FIO/FS 5 Concepts: Principles Software installs/updates through RPM Configuration through one tool Configuration information through one interface Configuration information stored centrally Installation, configuration and maintenance automated, but steerable Reproducibility
20 October 2003Thorsten Kleinwort IT/FIO/FS 6 Framework node Mon Agent Monitoring Manager Cfg Agent Config Manager Config Cache SW Agent SW Manager SW Cache Hardware Manager State Manager
20 October 2003Thorsten Kleinwort IT/FIO/FS 7 Framework node SW AgentCfg AgentMon Agent CDB Monitoring Manager SW Manager Hardware Manager State Manager CCM SW Cache
20 October 2003Thorsten Kleinwort IT/FIO/FS 8 Configuration (CDB & CCM) CDB (Configuration Data Base): Development of EU Data Grid (WP4) CDB is the configuration data base Now ~ 1500 nodes, ~ 15 clusters ~ 3200 configuration templates to describe the nodes Creates one (XML) profile per node All information that is needed to install & run the nodes now included Currently 2 Linux versions: RH 7.3 & ES 2.1
20 October 2003Thorsten Kleinwort IT/FIO/FS 9 CDB (cont’d) Additional Information to be added: (Merged from other sources) State information (->SMS) Monitoring information (->MSA) Vendor/Contract/Purchase information: Need for encryption to store secure data New, high level Interfaces are provided: “Add/Rename Node” Change node state
20 October 2003Thorsten Kleinwort IT/FIO/FS 10 CDB (cont’d) Local caching on the node CCM (Configuration Cache Manager): In test phase, deployed on a few nodes Runs local daemon, which is notified on modification of the nodes configuration information Avoids peaks on CDB web servers Beside XML profiles, new SQL interface: Allows SQL queries on CDB Needed for cross machine view (e.g. give me all nodes that belong to the cluster X)
20 October 2003Thorsten Kleinwort IT/FIO/FS 11 Framework node SPMACfg AgentMon Agent CDB Monitoring Manager SWRep Hardware Manager State Manager CCM SWRep Cache
20 October 2003Thorsten Kleinwort IT/FIO/FS 12 Software distribution (SPMA & SWRep) SPMA (Software Package Management Agent): Development of EU Data Grid (WP4) The tool to install all software on the nodes Uses RPM for SW distribution on Linux Version for Solaris PKG package manager exists We install between 700 – 1000 RPMs per node Based on RPMT (Enhancement of RPM) Crucial part of the framework
20 October 2003Thorsten Kleinwort IT/FIO/FS 13 SPMA (cont’d) SPMA runs on every node (on demand) Can manage either a subset or all packages: We manage all packages on all clusters but one, which is for development Missing packages are added and Unknown packages are removed Package list created from CDB, but SPMA is independent of CDB SPMA allows to roll back versions
20 October 2003Thorsten Kleinwort IT/FIO/FS 14 SPMA & SWRep SWRep (Software Repository): Client-Server tool suite for storage of software packages Universal: Linux RPM/Solaris PKG Multiple versions: RH 7.3, RH ES 2.1, RH 10 Management interface: ACL mechanism to add packages Package list automatically kept up-to-date in CDB
20 October 2003Thorsten Kleinwort IT/FIO/FS 15 SPMA & SWRep (cont’d) Addresses Scalability: HTTP as SW distribution protocol Load balanced server cluster SPMA run is randomly time delayed within 10 minutes Pre-caching of SW packages on the node possible Currently installed on 1500 nodes
20 October 2003Thorsten Kleinwort IT/FIO/FS 16 Framework node SPMANCMMon Agent CDB Monitoring Manager SWRep Hardware Manager State Manager CCM SWRep Cache
20 October 2003Thorsten Kleinwort IT/FIO/FS 17 Configuration Tool (NCM) NCM (Node Configuration Manager): Local configuration tool EU Data Grid (WP4) development First components have been (re-)written and are tested on production nodes Uses CDB for configuration information Has its first public release: We have to transform all our SUE features into NCM components (~50) Plan is to do this while migrating to next Linux release
20 October 2003Thorsten Kleinwort IT/FIO/FS 18 Framework node SPMANCMMSA CDBOraMonSWRep CCM SWRep Cache Hardware Manager State Manager
20 October 2003Thorsten Kleinwort IT/FIO/FS 19 Monitoring (MSA & OraMon) LEMON (LHC Era Monitoring): EU Data Grid (WP4) development Client (MSA): ~ 100 metrics are measured Deployed on > 1500 nodes (more than currently managed by CDB) Configuration to be put into CDB Server (OraMon): ORACLE database as back end Stores current values as well as history User API (in C, PERL, PHP, TCL) in test phase
20 October 2003Thorsten Kleinwort IT/FIO/FS 20 Framework node SPMANCMMSA CDBOraMonSWRep HMSSMS CCM SWRep Cache
20 October 2003Thorsten Kleinwort IT/FIO/FS 21 State Management (SMS & HMS) LEAF (LHC Era Automated Fabric): HMS (Hardware Management System), controls & tracks: Node installation Node Move & reinstall (rename) Node retirement Node repairs (Vendor calls) Remedy Workflow Application Will interface to CDB
20 October 2003Thorsten Kleinwort IT/FIO/FS 22 HMS & SMS SMS (State Management System): Allows to set node states (in CDB) Validates state transition Handles new machine arrivals (~400 in Nov) Uses SOAP to interface to CDB Working prototype
20 October 2003Thorsten Kleinwort IT/FIO/FS 23 node Tools: SPMANCMMSA CDBOraMon SWRep CCM SWRep Cache HMSSMS QUATTOR LEMON LEAF =++
20 October 2003Thorsten Kleinwort IT/FIO/FS 24 Tools: Examples Batch System LSF: Upgrade 4.2 -> 5.1 on > 1000 nodes within 15 min, without stopping batch (with pre-caching) Kernel Upgrade: SPMA can handle multiple versions of the same package: Allows to separate installation and reboot of new kernel in time Security upgrades: All security upgrades are done by SPMA (~once a week): SSH Security upgrade KDE upgrade (~400 MB per node)
20 October 2003Thorsten Kleinwort IT/FIO/FS 25 References EU Data Grid: EDG WP4: QUATTOR web page: LEMON web page: LEAF web page: CERN IT/FIO: