Status of Fabric Management at CERN LHC Computing Comprehensive Review 14/11/05 Presented by G. Cancio – IT-FIO
Fabric Management with ELFms ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: : configuration, installation and management of nodes : system / service monitoring : hardware / state management ELFms manages and controls heterogeneous CERN-CC environment Supported OS: Linux (RHES2/3, SLC3 32/64bit) and Solaris 9 Functionality: batch nodes, disk servers, tape servers, DB, web, … Heterogeneous hardware: CPU, memory, HD size,.. Node Configuration Management
http://quattor.org
Quattor Quattor takes care of the configuration, installation and management of fabric nodes A Configuration Database holds the ‘desired state’ of all fabric elements Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, …) Cluster setup (name and type, batch system, load balancing info…) Site setup Defined in templates arranged in hierarchies – common properties set only once Autonomous management agents running on the node for Base installation Service (re-)configuration Software installation and management
Node Configuration Manager NCM Architecture Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP XML configuration profiles SW server(s) HTTP SW Repository RPMs Install server HTTP / PXE System installer Install Manager base OS Node Configuration Manager NCM CompA CompB CompC ServiceA ServiceB ServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes
Quattor at CERN (I) Quattor in complete control of CC Linux boxes (~ 2600 nodes) Over 100 NCM configuration components developed for full automation of (almost) all Linux services LCG: Components available for a fully automated LCG-2 configuration EGEE: is using quattor for managing their gLite integration testbeds gLite configuration mechanism integration underway
Quattor at CERN (II) Flexible and automated reconfiguration / reallocation of CC resources demonstrated with ATLAS TDAQ tests Creation of an ATLAS DAQ / HLT test cluster Automated configuration of specific system and network parameters according to ATLAS requirements Automated installation on ATLAS software in RPM format Reallocated significant fraction of LXBATCH (700 nodes) to new cluster during June/July All resources reinstalled and re-integrated into LXBATCH in 36h Linux for Control systems (“LinuxFC”) Quattor-based project for managing Linux servers used in LHC Control systems Strict requirements on system configuration and software management E.g. versioning and cloning of configurations, rollback of configuration and software, validation and verification of configurations, remote management capabilities…
Quattor outside CERN Sites using Quattor in production: 17 13 LCG sites vs. 3 non-LCG sites including NIKHEF, DESY, CNAF, IN2P3:LAL/DAPNIA/CPPM,.. Used for managing grid and/or local services Ranging from 4 to 600 nodes; total ~ 1000, plans to grow to ~ 2600
Quattor Next Steps Improvements in the Configuration DB: Security ACL support for CDB configuration templates Control who can access what templates with an ACL based mechanism Support for CDB Namespaces (“test”, “production” setups, manage multiple sites) Performance improvements (SQL back-end) Security deployment of secure XML profile transport (HTTPS)
http://cern.ch/lemon
Lemon Lemon (LHC Era Monitoring) is a client-server tool suite for monitoring status and performance comprising: a monitoring agent running on each node and sending data to the central repository sensors to measure the values of various metrics (managed by the agent) Several sensors exist to monitor node performance, process, hw and sw monitoring, database monitoring, security, alarms, “external” metrics e.g. power consumption a central repository to store the full monitoring history two implementations, Oracle or flat file based an RRD/Web based display framework
Architecture Monitoring Repository Lemon CLI RRDTool / PHP Correlation TCP/UDP SOAP backend SQL RRDTool / PHP apache HTTP Correlation Engines Nodes Monitoring Agent Sensor Web browser Lemon CLI User User Workstations
Lemon Web interface Using a web-based status display: CC Overview
Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes
Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes VO’s
Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes VO’s Batch system
Lemon Web interface Using a web-based status display: CC Overview Clusters and nodes VO’s Batch system Database (Oracle) Monitoring
Lemon Deployment CERN Computer Centre: ~ 400 metrics sampled every 30s -> 1d; ~ 1.5 GB of data / day on ~ 2600 nodes Interfaced to Quattor… Monitoring configuration CDB discrepancy detection Outside CERN-CC: LCG sites (180 sites with 1,100 nodes – used by GridIce) AB department at CERN (~100 nodes), CMS online (64 nodes and planning for 400+) Others (TU Aachen, S3group/US, BARC India, evaluations by IN2P3, CNAF)
Lemon Next Steps Service based views (user perspective) Synoptical view of what services are running how – appropriate for end users and managers Needs to be built on top of Quattor and Lemon Will require a separate service definition DB Alarm system for operators Allow operators to receive, acknowledge, ignore, hide, process alarms received via Lemon Alarm reduction facilities Security SSL (RSA,DSA or X509) based authentication and possibility of encryption of data between agent and server Access – XML based secure access to Repository data
http://cern.ch/leaf
LEAF - LHC Era Automated Fabric LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: HMS (Hardware Management System): Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement HMS implementation is CERN specific (based on Remedy workflows), but concepts and design should be generic SMS (State Management System): Automated handling (and tracking of) high-level configuration steps E.g. reconfigure and reboot all cluster nodes for new kernel and/or physical move Heavily used during this year’s ATLAS TDAQ tests GUI for HMS/SMS being developed: CCTracker …
CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment
CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment properties
CCTracker Visualize, locate and manage CC objects using high-level workflows Visualize physical location of equipment properties Initiate and track workflows on hardware and services e.g. add/remove/retire operations, update properties, kernel and OS upgrades, etc
Summary ELFms in smooth operation at CERN and other T1/T2 institutes Grid and local services New domains being entered Online DAQ/HLT farms: ATLAS TDAQ/HLT tests Accelerator Controls: LinuxFC project Core framework developments finished and matured, but still work to be done CDB extensions Displays for Operator Console and Service Status Views Security http://cern.ch/elfms