Status and plans of central CERN Linux facilities

Status and plans of central CERN Linux facilities
Thorsten Kleinwort IT/FIO-FS For PH/SFT Group

Thorsten Kleinwort IT/FIO/FS
Introduction 2 years ago: Post C5 on migration from RH6 to RH7 Now: Migration from RH7 to SLC3 Achievements: Scalability Tools framework Scope Conclusions & outlook Scale: o(500), Tools: Still migrating from old to new tools Tools: still in migration phase from old to new tools. New tools were written with final scale in mind. Scope: mostly LXBATCH and LXPLUS: well defined environment, new tools were forged for this scope. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Operating System Scalability Tools framework Scope 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Operating System SLC3 new default Operating System: LXPLUS fully migrated, new h/w small rest on RH7 o(5) LXBATCH 95% on SLC3 Today we have finished the migration, and we are diminishing RH7. In the meantime, we had to start supporting the licensed version if Linux RH ES 2, now RH ES 3 which added some complexity to the problem, e.g. RH ES does not support AFS Now, we are about to increase the complexity even further by going from the single 32 bit architecture i386 to support ia64 and possibly xf86_64 as well (CASTORGRID, SC) on SLC3. No major problems, but additional work, like recompiling the binary RPMs, and minor problems, like different boot loaders on i386 and ia64. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

19 July 2018 Thorsten Kleinwort IT/FIO/FS

Operating System SLC3 new default platform: LXPLUS fully migrated, new h/w small rest on RH7 o(5) LXBATCH 95% on SLC3 Rest to be migrated soon (even old h/w) Other clusters are migrated now as well: LXGATE, LXBUILD, LXSERV, … Still some problems on special Clusters with special hardware (disk, tape server) Today we have finished the migration, and we are diminishing RH7. In the meantime, we had to start supporting the licensed version if Linux RH ES 2, now RH ES 3 which added some complexity to the problem, e.g. RH ES does not support AFS Now, we are about to increase the complexity even further by going from the single 32 bit architecture i386 to support ia64 and possibly xf86_64 as well (CASTORGRID, SC) on SLC3. No major problems, but additional work, like recompiling the binary RPMs, and minor problems, like different boot loaders on i386 and ia64. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Operating System Besides this ‘main’ OS, we have RHES: RH ES 2 as well as RH ES 3 Needed for ORACLE Now supporting also other architectures: ia64, {xf86_64} Needed for Service Challenge (CASTORGRID) No major problems, but: Additional work to provide/maintain those Minor differences, e.g. no AFS on ES, lilo on ia64 Today we have finished the migration, and we are diminishing RH7. In the meantime, we had to start supporting the licensed version if Linux RH ES 2, now RH ES 3 which added some complexity to the problem, e.g. RH ES does not support AFS Now, we are about to increase the complexity even further by going from the single 32 bit architecture i386 to support ia64 and possibly xf86_64 as well (CASTORGRID, SC) on SLC3. No major problems, but additional work, like recompiling the binary RPMs, and minor problems, like different boot loaders on i386 and ia64. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Scalability Already reached 1000 nodes with RH7 Automated node installation Now at 2200 Quattor managed machines Machine arrive in bunches o(100) Installed/stress-tested/moved Now, cluster management automated Kernel upgrade on LXBATCH Vault move/renumbering Cluster upgrade to new version of OS At our current scale, you always have machines down, broken, reinstalled, on vendor call Machines are now usually bought/moved/handled in big numbers ~100 E.g. Vault move (Re-)installations are done in big numbers as well, but still single reinstalls Everything had to be automated to scale: Upgrade of OS/Kernel fully automated: Whenever machine is drained 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Cluster upgrade workflow

Scalability Batch System LSF: We are up to jobs in ~2500 slots So far o.k., except for AFS copy -> NFS Infrastructure has to scale as well: Power, cooling, space, network,… At our current scale, you always have machines down, broken, reinstalled, on vendor call Machines are now usually bought/moved/handled in big numbers ~100 E.g. Vault move (Re-)installations are done in big numbers as well, but still single reinstalls Everything had to be automated to scale: Upgrade of OS/Kernel fully automated: Whenever machine is drained 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Tools framework We adapted EDG-WP4 tools for our needs With RH7 still hybrid with old tools (SUE, ASIS), now clean on SLC3 Improved and strengthened them in ELFms: Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL interface With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

CDB: Web access tool 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Tools framework We adapted EDG-WP4 tools for our needs With RH7 still hybrid with old tools (SUE, ASIS), now clean on SLC3 Improved and strengthened them in ELFms: Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL interface Lemon Monitoring, including web interface With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Lemon Start Page 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Lemon: E.g. LXBATCH 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Tools framework We adapted EDG-WP4 tools for our needs With RH7 still hybrid with old tools (SUE, ASIS), now clean on SLC3 Improved and strengthened them in ELFms: Quattor, with SPMA and NCM configuration framework CDB configuration database with SQL (r/o) interface Lemon Monitoring, including web interface LEAF, the SMS and HMS framework With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

LEAF: CCTracker & HMS 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Tools framework We rely on other tools/groups: All Linux version come from Linux Support: Need for new version increases their workload, too AIMS: our boot/installation service LANDB: now SOAP interface instead of the web Good collaboration The scale increases the pressure for robust tools on their side as well With RH 7, we had (old) ASIS as well as (new) SPMA: Now only SPMA for s/w (RPM) installation With RH 7, we had (old) SUE as well as (new) NCM: Now only NCM as configuration tool With RH7, we have managed to bring together behind a common interface, all (>25) different places were configuration information was stored. Now, we are only using one database (CDB) for the storage On RH 7, we started to deploy the WP4 monitoring (Lemon) on our machines, later, we migrated the configuration into CDB. Now we start to implement automatic recovery and fault tolerance. We have already met some scalability problems with going up in scale: Monitoring information in CDB caused a problem CDB had to be speeded up (>2:00 of recompilation of the whole lot) The state management was taken out of the CDB and put into CDBSQL Completele new LEAF toolsuite: State management SMS, to allow traceable states for machines not in production, and to allow state transitions Hardware management HMS, to allow better control of lifetime of h/w from arrival to retirement, including e.g. vendor calls or moves/reassignements. 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Scope Original scope for the framework was LXBATCH/LXPLUS Framework adapted to other clusters: LXGATE, LXBUILD: Similar to LXPLUS/LXBATCH Disk Server, Tape Server: Different h/w, larger variety, more special configuration Non – FIO cluster: LXPARC GM (EGEE): several clusters, used for tests and prototyping GD (LCG test clusters) Originally, while going from RH6 to RH7, we were also diminishing our other platforms, to reduce the diversity. With the new tools in place, the number of (divert) clusters increased again. E.g.: Disk Servers Tape Servers For these two new types of clusters, the tools have to be enhanced: New configuration components had to be written, others were not used h/w variety much bigger on disk servers, therefore more kernel dirvers needed, kernel parameters have to be tweaked. Still on going issue of problems with some tape server kernel drivers and SLC3. The level of automation is lower for disk/tapes servers as for CPU servers 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Scope These new clusters: Increase the scale even further Enlarge the requirements for the tools, e.g. New NCM components New SMS/HMS states/workflows Additional local users,… Come with new OS requirements, e.g. RH ES for ORACLE servers Ia64 support for new CASTORGRID machines Proper testing for new s/w, OS, kernel has to be done on the cluster level 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Fabric Services as part of the GRID
Additional LCG s/w was incorporated into our Framework All SLC3 LXBATCH nodes (>800 MHz) are WN CERN-PROD biggest site >1800 CPUs UI available on LXPLUS CE on LXGATE, 2 at the moment SE, cluster of 6 machines, running SRM and CASTORGRID All upgraded to LCG_2_4_0 19 July 2018 Thorsten Kleinwort IT/FIO/FS

GOC Entry for CERN-PROD

GRID Monitoring: 19 July 2018 Thorsten Kleinwort IT/FIO/FS

GRID Resource Infos: 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Conclusions & outlook Not only migrated to one new OS Next one: SLC4 or SLC5? Tools are ready, no major problems foreseen We have overcome some scalability issues Prepared to go to LHC scale Tools: Gone from machine automation to cluster automation Improve usability Increase robustness Decrease necessary expert level Scope: From LXBATCH/LXPLUS to many different clusters How to manage non-FIO Clusters? 19 July 2018 Thorsten Kleinwort IT/FIO/FS

Status and plans of central CERN Linux facilities

Similar presentations

Presentation on theme: "Status and plans of central CERN Linux facilities"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Status and plans of central CERN Linux facilities

Similar presentations

Presentation on theme: "Status and plans of central CERN Linux facilities"— Presentation transcript:

Similar presentations

About project

Feedback