Installing and Running a CMS T3 Using OSG Software - UCR Bill Strossman
Overview of Cluster 1 Head Node 4 Storage Nodes (top, bottom, strange, charm) 4 Apple XServe RAID boxes 2 RAID 0 arrays of 7 500 GB drives each Arrays form a 7 TB logical volume (RAID 00?) 10 Compute Nodes Warewulf 2.4 clustering software – now 2.6 SL 3.0.5 installed initially – now SL 4.5 (x86_64) 32-bit compatibility libraries, compilers installed Compute nodes get their system from an image file via TFTP
Head Node (2) Dual Core AMD Opteron 275 CPUs 4 GB RAM (2) 250 GB drives – mirrored (3) 1 Gb ethernet ports
Storage and Compute Nodes Storage Nodes 2 AMD Opteron 250 4 GB RAM 1 250 GB disk 2 Apple FC ports 2 1 Gb ethernet 7 TB of externally attached FC disk storage (Apple XServe RAID) Compute Nodes 2 AMD Opteron 275 4 GB RAM 2 250 GB disks 2 1 Gb ethernet ports
UCR-HEP
Close-up
Challenges Hardware failures Installation of OSG Software Compute Element GUMS Squid Condor OSG Client Installation of CMS Software CMSSW PhEDEx Operation and Issues
Hardware Failures Many, many bad RAM sticks Drive failures Fans At least one-third of the RAM has been replaced Vendor acknowledges getting a bad batch Drive failures 2 in Apple XServe RAID boxes 1 in storage node Fans 5 power supply fans 2 CPU fans Miscellaneous Fibre channel controller in APPLE Xserve RAID Heat-related incidents in APPLE Xserve RAID box
Installation of OSG Software Compute Element Started at 0.40, now at 0.80 Had a lot of help with first install Upgrades have been fairly smooth; had to make changes to site-local-conf.xml and storage.xml among others (more on this later). GUMS Was a nightmare to get going at first All notices about adding a new VO, etc. only list instructions for sites using a gridmap file. Dies on rare occasions leading to inability to run grid jobs and “red” SAM results.
Installation of OSG Software (continued) Squid Very easy to get up and running Documentation is good Condor Had a lot of trouble getting it to work with both internal and external networks at first Upgrades are fairly easy as configuration files can be copied over for the most part Does not handle group quotas in a desirable way OSG Client Easy to install and configure Need to configure Condor for this as well
Installation of CMS Software PhEDEx A nightmare to get going initially 32/64-bit incompatibilities (we were x86_64) SL3 -vs- SL4 (we were running SL4) Documentation was sketchy; much better now Had a lot of help to finally get it going Upgrading from 2.5 – 3.0 was pretty painless Needed to modify storage.xml, ConfigPart.Download, Config.Prod, etc. for srmv2 and new site name. Had to get new database roles and update DBParam Had to request a link to UCSD to retrieve data that was only stored there.
Installation of CMS Software (continued) CMSSW Easy to install once you match up the architecture Not so easy to get working properly After installing a few versions by hand, I jumped at the opportunity to have Bockjoo Kim install and maintain the many versions using a grid job We do not have a Storage Element (yet), so it is necessary to maintain site-local-conf.xml and storage.xml versions that are separate from PhEDEx for stage-out to UCSD.
Software Layout (storage nodes were a boon!) top CE Condor submit host for OSG users bottom GUMS PhEDEx charm Squid strange OSG Client Condor submit host for local users
Operations and Issues PhEDEx Voms proxy -vs- grid proxy Grid proxies can be set not to expire for several months Some sites accept voms proxy only, which has a maximum lifetime of 8 days. Need good, reliable way to automate renewal of the voms proxy (modify and use provided myProxy server configuration file ?) Need to make sure that enough space exists for the desired dataset. It has been necessary to move things around and create soft links to maintain the directory structure
Operations and Issues (continued) OSG Software GUMS must be restarted occasionally Renew host and service certificates annually Update CRLs (now automatic if enabled) Getting our site to be fully “green” Needed to obtain the file oneEvt.root in order to pass the SAM “mc” test. Needed to obtain the 24 QCD root files in order to pass the SAM “analysis” test Making sure that it stays green It can be difficult to enforce group quotas in Condor, so I modified a script to assign a group based on the mapped userid (UCSD)
Operations and Issues (continued) Other Non-OSG upgrades SL 3.0.5 – SL 4.5 Warewulf 2.4 - Warewulf 2.6 Regularly check that: Disk volumes are not full All nodes are up and all NFS mounts are intact All fans are operational System logs do not contain critical error messages
Upgrade Notes OSG mv osgce osgce-0.6; mkdir osgce update pacman, if necessary follow instructions on Compute Element Install twiki rename new condor_config and copy old one over Certs, gsi-authz.conf, and prima-authz.conf are physically located in /etc/grid-security so no need to worry about them vdt-control --on
Upgrade Notes (continued) PhEDEx (from 2.x - 3.x) mv /home/phedex /home/phedex-old mkdir /home/phedex chown -R phedex.phedex /home/phedex follow instruction at the Site deployment link under Documentation (PhEDEX home page) copy over and modify configuration files mentioned earlier
Future Plans dCache Storage Element Move from Warewulf to Perceus Head node and two storage nodes Two Xserve RAID boxes Move from Warewulf to Perceus SL 5.x (?) OSG-1.0 myProxy server to automate renewal of phEDEx voms proxy (?)
Acknowledgements Terrence Martin (UCSD) Frank Wuerthwein (UCSD) Brian Bockelman (UNL) Bockjoo Kim (UF) Burt Holzman (FNAL) Robert Clare (UCR) OSG Operations Staff PhEDEx Hypernews contributors UCR Network Operations Bob Grant (UCR Comp. & Comm.)