Download presentation
Presentation is loading. Please wait.
1
Ste Jones John Bland Rob Fay
Liverpool Site Report HepSysMan, RAL June 2018 Ste Jones John Bland Rob Fay
2
Introduction Changes to our server room.
Outline of the current rack layout. Our storage system. Our progress on IPv6. Recent procurement. Recent changes. Things coming up. Site Report 25/10/2019 2
3
Changes to server room Paid for by the University.
We gave up part of our cluster room in return - it used to hold an IBM 370 mainframe (anyone remember ISPF? TSO? JCL?) and it was big and blue, I’m told. We got a 300 KW chiller on the roof. We got 4 new air con units, amounting to 100KW. We got new electric panels. And hot/cold aisle Eaton racks,EMIB09/EMIB16 PDUs. It took months to get it made and move things over. We turned things off for a few days to make the move. We got some monitoring goodies. Much more efficient cooling. More redundancy. Much more reliable. Site Report 25/10/2019 3
8
For professional, stylish, modern
Photography For professional, stylish, modern gridsite photography, contact John Bland. Site Report 25/10/2019 8
9
Site layout - clusters Seven WN hardware types; E5-2630V2, E5-2630V3, E V4, E5620, L5420, L5530, X5650 Two OSs (sl6, c7) Four “different” clusters to look after; arc/condor (c7), arc/condor (sl6), vac (local), vac (remote) Condor C7 – HS06 Condor SL6 – 7670 HS06 VAC (total) 8168 HS06 No docker. No singularity. Total 26,634 HS06 (when it’s all on). Slots: 2498. Aside: It’s funny how slots always have about 10 x HS06, whatever our set- up. Given the error margins, we could just as well dispense with all the bench-marking and just express power as slots provided … imagine all the hand-wringing we could cut out at one stroke. Site Report 25/10/2019 9
10
Site layout - clusters Liverpool also hosts a local batch system for our researchers that runs SLURM. It has about 120 slots, say 1,200 HS06. Users have the use of a Gluster file system, and also access to space on our main grid DPM storage system systems. Site Report 25/10/2019 10
11
Site layout - storage All on DPM 1.9.2-1 1 head node
17 storage servers All on CentOS All Supermicro. Total of around 1.5 PB. One storage server uses ZFS, about 16% of total. The others use RAID6, various sizes, various cards - 3ware, adaptec, areca, megaraid (which now dominates) Site Report 25/10/2019 11
12
Site layout - storage Some general storage notes from John Bland (admin) ZFS is fine, works well, but needs some TLC wrt admin. Megaraid more stable/easier to use than older RAID cards. Some DOME is installed but not active. We’re waiting for more info on how to co-ordinate, esp. wrt. Quota Tokens vs. Space Tokens, VO co-ordination etc. i.e. we need to see a centralised transition plan to move from DPM Classic to DPM DOME. Site Report 25/10/2019 12
13
Site layout - storage CentOS7 Storage notes from John
C7 DPM installed using kickstart and central puppet for local settings eg firewalls, ssh access, monitoring etc. All our DPM systems run C7.5 but mixing SL6/C7 and different versions of DPM has generally been ok. We then use DPM puppet modules to configure DPM using 'puppet apply' (controlability, independence, known state.) Local puppet ensures httpd is running but otherwise doesn't touch it. Some “jiggery-pokery” with systemd and system limits like open file numbers to get mysql to work properly with DPM. Not obvious. DPM by default has a cron job that gracefully restarts httpd. Not had any issues on pool nodes, occasionally this leaves httpd on the headnode in a broken start. Restarting fully fixes it. httpd stability is the only major problem we have with DPM. Site Report 25/10/2019 13
14
Site layout - storage Continued…
Other than getting to grips with systemd/firewalld not much difference between SL6 and C7 for servers. Desktops were a much bigger challenge with lots of silly bugs, workarounds, broken backwards compatibility and general opaque and unpredictable behaviour. In general I've found the transition from SL6 to C7 to be more time consuming, problematic and frustrating than any of the SL4>SL5>SL6 upgrades. Site Report 25/10/2019 14
15
Site layout – real vs VM The only “operational grid service servers” we use that are bare metal are the storage servers and the storage head-node. The other service servers are virtual (KVM). They comprise: BDII Arc/Condor C7 head-node Arc/Condor SL6 head-node ARGUS All run Cent-OS 7.5. We use no APEL system, since both ARC and VAC transmit their own accounting data directly. We use several powerful servers to host the VMs. Typical is hepvm3: 128 GB RAM, 32 CPU, ½ TB local space. And we use other “real” servers with huge disks (e.g. 10TB, Linux SW Raid) to host a large number of QCOW2 images, and yum repositories that are regularly mirrored from the software providers, such as UMD, Puppet, cvmfs, Nordugrid, wlcg, CAs, Linux distros, EPEL, local software, HTCondor, etc. Also act as SQUIDs. Site Report 25/10/2019 15
16
IPv6 The ice may be starting to break on this. We have an allocation.
We tested IPv6 bandwidth some time ago and found it to be insufficient for operations. At present, we only have IPv6 Perfsonar. Progress depended on more bandwidth, which was being throttled. We had negotiations with central providers at Liverpool; questions of costs arose. But talks are still on-going on the final baseline. JISC is involving itself nationally, we believe. In the last couple of months, further tests show that bandwidth has increased “a lot”, but central IT has a new network team leader and we now have delays. But once these formalities are done, we should be able to get some services ported over. The idea of Science DMZ is becoming a known thing. We want/need to develop best practises for IPv6, e.g. deploy with DHCP, auto, manual, etc. Site Report 25/10/2019 16
17
Last purchase round Workers:
3 * Quad node E5-2630V4 (12 motherboards in total) 40 slots per node, 10.36 HS06 per slot, Total HS06 DPM Head Node: E v4, 16 Core, 64 GB RAM, Megaraid raid controller New “deployment server” (Yum repos, file, etc.) CPU E v6, 8 core, 32 GB RAM, 10 TB Disk, S/W Raid Site Report 25/10/2019 17
18
Stuff done Stuff done Site layout database. More CentOS7 adoption.
CentOS 7.4 to (sec. updates drop off ) Migrated to Puppet 3 (now we need to go to 4 or 5) Migrated to new file server/repository server (old one was tight on space and starting to break). Updates to CVMFS (sl6), s.5 (c7) UMD4 All that Meltdown stuff Site Report 25/10/2019 18
19
Stuff to do Site Report 25/10/2019 19
20
Stuff to do Stuff to do IPv6 (starting to look real)
VAC 3, VAC mcore, VAC pipes… DOME transition Puppet 4, 5 Scrap last sl6 Condor WNs, Move VAC to C7. DUNE, with Man, Edin, IC, Shef, … Look at/test usefulness of containers, docker, singularity etc. Minor accounting bugs/discrepancies to look at for Nordugrid/IC/All arc sites. New purchase round maybe. Tom’s user guide update/more usability tests Perhaps test Robin’s ARGUS approach. Etc. Site Report 25/10/2019 20
21
End Fin Site Report 25/10/2019 21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.