RHUL Site Report Govind Songara, Antonio Perez, Simon George, Barry Green, Tom Crane HEP Sysman meeting @ RAL , June 2018
Manpower Antonio - Tier-3/Tier2 Sysadmin - 1FTE . Govind - Tier-2 Grid Admin - 0.5 FTE. Simon - Site Manager. Barry - Hardware/Network Specialist. Tom - All rounder. 2
Group Activities Accelerator ATLAS Small DAQ systems. Benefit from strong collaboration in software support. Large Tier3 batch compute and storage resources for data analysis. DAQ test systems. Dark Matter Detector development (lab DAQ systems). Growing need for compute and storage resources to analyse data. Need help with things like s installation, data movement. Accelerator Small DAQ systems. Many small activities around the world which generates unique and valuable data sets. Simulation: both embarrassingly parallel and multi-process (MPI) computing. Software development infrastructure (e.g. cdash server). Theory Occasional use of Tier3 cluster. Interest in MPI. 3
Tier-2 9 Racks @ Huntersdale site Upgraded from 8 to 9 racks. The additional rack is off to the right of the photo. The installation by Dell has been a struggle (wrong length cables had to be redone, inability to change delivery location/instructions, labelling instructions not passed on) . 4
Tier-2 Grid 9 racks in modern machine room provided by central IT service. New Kit: 10 x Dell PowerEdge C6420, Xeon(R) Gold 6148 CPU @ 2.40GHz, 40core, SL6-HS06-score: 966 / CC7-HS06-Score 1004 with HT 48 kHS06 in 150 WN; Cream/Torque CE x 2, moving to ARC/HTC 1.4 PB DPM SE in 47 servers 8 misc servers including 3 VM hosts running standard network services, provisioning, Grid services. Network restructured. Using stacking to solve some reliability problems. See next slides. 0.5 FTE Vac-in-a-box nodes updated to latest version and CC7. Storage Pledge – Compute Pledge – 1 more rack. 10 more worker nodes added with about 10k HepSpecs score. So the total is about 48k hepspecs. Our intention is to move from Torque to ARC/HTCondor 10 x New kit added. DPS = Disk pool Manager This is the projection of our capacity, we will be adding more CPU power in the future in instead of storage. Thanks to Andrew MacNabb Vac in a box nodes were updated to CC7 and last VIAB version. Currently is running in about 10 old nodes. The need to have custom routing to put storage traffic on our private network is preventing us from expand this. 5
Network design 2016 Network design from last year. Cacti doesn't show bottlenecks. Outages caused by individual routers. Connection to Tier3 Slow.
Network design 2018 Moved from 2 switches trunked at 4x40 Gb/s links to a ring of 3 switches with 2x40Gb/s links between them. This configuration prevents downtime and improves the fail over. 10 new nodes added.
Tier-2 Grid Issues Planning Some services impacted due to the network upgrade. Long term expansion limited by rack space and cooling, need to decommissioned old kit to make room. The need to have custom routing to put storage traffic on our private network currently prevents us from expanding VIAB. Planning CC7 Deployment server, services, cluster, etc. CC7 DPM pool node upgrade IPV6 roll out still on going. The installation by Dell has been a struggle (wrong length cables had to be redone, inability to change delivery location/instructions, labelling instructions not passed on) . During the implementation of the network some services were impacted. The expansion will force us to retire old nodes in order to place new ones. Planning Update all our CC6 to CC7. Upgrade our Disk Pool Manager. IPV6 works still on going. 8
Tier -3 Network mapping – NetDisco, lldp on all hosts 8 racks packed into home-made machine room, a long way from Tier2 on 1 Gb/s link 6 kHS06 in ~100 old WN, mostly hand downs from Tier2, some upgraded; Torque. Storage: 304TB Hadoop using 90 WNs 262 TB NFS scratch over 7 servers 11 TB NFS Home 11 servers running standard network services, mainly as Vms. Hadoop updated from 2.4.1 to 2.7.3 Network mapping – NetDisco, lldp on all hosts Hadoop Hadoop: Updated from 69 to 90 nodes Hadoop from 80Tb to 304Tb Scratch updated with 80Tb Hadoop upgrade. Problems. Several services running in VMs: linappservs1, 2 and 3, squids, NAT, DB, CUPS, Hadoop updated. Scratch 9
Tier-3 Issues Planning Migrate batch system from torque to HTCondor. Aircon failures. Shut down script creation. Hadoop mismatch version between namenode and datanodes prevented load balancer to work. Issue found in HTCondor setup. Problem found with HTCondor and AMD cpus where a HTCondor was crashing only when running under AMD. Planning Migrate batch system from torque to HTCondor. Hadoop test environment update to version 3. Aircon failure forced us to shutdown the nodes and leave only critical services. Someone had to go to the uni to leave doors opened to avoid more damage. Hadoop wasn't able to balance the capacity of the nodes. So the update of some nodes failed because the datanode wasn't able to start due to full filesystem. Planning Migrate batch system from PBC to HTCondor. Upgrade Hadoop test environment. 10
HPC New HPC located in Huntersdale. 1 xcat node + 11 nodes: 20 x Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 62GB RAM. 22Tb of storage. Running mainly, SLURM and easybuild under Centos 7.2. Using Salt as a configuration manager. Centos 7 AD authentication setup on going. Big problem with massive disk failure leading to a system rebuilt. On going tests to link it with Computer Center AD.
Thank You