Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO
WP4 deployment at CERN CC / - n° 2 Agenda u Why, what, when u CERN enhancements u Issues u Summary
WP4 deployment at CERN CC / - n° 3 Why? u CERN Computer Centre is being scaled up in capacity for handling LHC capacity requirements n O(10K) nodes expected in 2006/7 n Currently, ~ 1.5K n Concentrating on Linux on Intel Hardware u Current management software in production was written (many!) years ago, for a reduced number of heterogeneous RISC UNIX nodes n No central configuration mechanism (>20 databases holding configuration information!) n Insufficient automatization (manual steps required for node installation and configuration) n Scalability issues: eg. software distribution n Outdated technology obsoleted by upcoming de-facto standards (eg. RPM) u CERN’s aim participating in WP4 was to find solutions to this problem!
WP4 deployment at CERN CC / - n° 4 What? u Central Configuration Database n CDB set up for hosting configuration profiles of the main CERN interactive and batch services (LXPLUS and LXBATCH) n Templates created for hardware information (currently: new SEIL nodes): n System information including partition layout n Software packages configuration structured in templates (RH distribution, CERN CC packages, EDG packages) u Software Distribution n SWRep: hosting RPM packages n SPMA: Replaces legacy software update mechanisms (ASIS, rpmupdate, SUE security) u Node configuration n CCM: access to CDB information via NVA API n NCM: will replace the CERN ‘SUE’ node management framework
WP4 deployment at CERN CC / - n° 5 When? u SPM (SPMA and SWRep) n Started pilot for phasing out the CERN SW distribution systems on batch worker nodes for LCG-1 n Using HTTP as package access protocol (scalability) n > 100 nodes currently running it in production n Deployment page: u CDB n Database in production status, holding site-wide, cluster and node specific configurations for ~ 350 clients n Using WP4 Global schema with site specific extensions n Enhancements: GUI (next slide) u CCM n About to be deployed u NCM n As soon as it gets available with components
WP4 deployment at CERN CC / - n° 6 CERN Enhancements (I) u GUI for CDB n PanGUIn: graphical front end for browsing, modifying, validating and committing CDB templates n Generic development (no site dependencies): Can be feed back into EDG u Configuration access shortcuts and migration n CCM provides a ‘low-level’ API for traversing the node configuration profile tree n CCConfig: ‘high-level’ API giving access for recurring information chunks (partitions, package lists, network information, cluster name, etc) n CCConfig allows to access CDB info from the current node configuration system (SUE) u Server clustering solution n For CDB (XML profiles) and SWRep (RPM’s over HTTP) n Replication done with rsync n Load balancing done with simple DNS round-robin n Currently, 3 servers in production (800 MHz, 500MB RAM, FastEthernet)
WP4 deployment at CERN CC / - n° 7
WP4 deployment at CERN CC / - n° 8 CERN enhancements (II) u Release life cycle in CDB n A high-level configuration change may work on 99.9% of the nodes but break on some n Even if CDB keeps history information and is able to roll back, it cannot easily keep multiple configurations in parallel n Solution: keep ‘test’, ‘new’, ‘pro’ and ‘old’ templates simultaneously in CDB; and migration scripts which perform transitions from ‘test’->’new’->’pro’- >’old’. n Most node configs point to ‘pro’; test nodes use ‘test’, certification nodes use ‘new’ n If a node fails with a new ‘pro’ configuration, it is pointed to use ‘old’ while understanding the problem, instead of rolling back the complete CDB
WP4 deployment at CERN CC / - n° 9 CERN enhancements (III) test2newnew2pro test new pro old eg. lxplus001,32,33 lxplus rest of lxplusxxx Nodes (object templates) Template Maintainers old newtest old. test new pro eg. lxplus001,32,33 lxplus rest of lxplusxxx pro
WP4 deployment at CERN CC / - n° 10 Issues (I) u Understand the current system n Identify current dependencies (on shared file systems, legacy services) n We cannot start over from scratch! u Migration considerations n Production environment -> minimal downtimes! n Large computer centre n WP4 not a drop-in, turn-key solution, needs integration with existing environment n Produce ‘glue’ procedures and scripts u Scalability and Performance on large clusters n SWRep: How many servers do we need for n nodes? n CDB: how long does it take to generate O(100), O(1000) configuration profiles? What are the CPU requirements?
WP4 deployment at CERN CC / - n° 11 Issues (II) u Careful planning and execution required n Find appropriate time slots for phase-ins and phase-outs (one at a time!) n Announcements to users when non-transparent changes required n update operation manuals for site administrators and technicians n Problems and failures have much higher impact than on EDG testbeds! u Not all R3 functionality available yet n … some new procedures are harder to follow than old ones! u Site wide deployment (outside computer centre) n Until now, same tools used for farms, servers and desktops n Desktop support kept in mind during WP4 design, but needs additional work u WP4 deployment timescale will exceed the lifetime of EDG n Support and bugfixes n Future developments n Port to new platforms (RH9, IA64, Solaris…)
WP4 deployment at CERN CC / - n° 12 Summary u WP4 software is becoming a key part of the CERN computer centre management u CERN CC has very high requirements on stability and reliability n WP4 is coping with it! (at least up to now..) u Most valuable feedback for EDG, both in terms of testing and functionality enhancements