Pete Gronbech GridPP Project Manager April 2016

Slides:



Advertisements
Similar presentations
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Advertisements

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
UKI-SouthGrid Overview GridPP30 Pete Gronbech SouthGrid Technical Coordinator and GridPP Project Manager Glasgow - March 2012.
Oxford Site Update HEPiX Sean Brisbane Tier 3 Linux System Administrator March 2015.
IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.
Lucien Boland and Sean Crosby Research Computing.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Cambridge Site Report John Hill 20 June 20131SouthGrid Face to Face.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Academia Sinica Grid Computing Centre (ASGC), Taiwan
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Review of the WLCG experiments compute plans
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
WLCG Network Discussion
Bob Ball/University of Michigan
GridPP DIRAC Daniela Bauer & Simon Fayer.
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Pete Gronbech GridPP Project Manager April 2017
ATLAS Cloud Operations
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Moroccan Grid Infrastructure MaGrid
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Stuart Wild. Particle Physics Group Meeting, January 2010.
How to enable computing
Moving from CREAM CE to ARC CE
Update on Plan for KISTI-GSDC
Experience of Lustre at a Tier-2 site
Support for IPv6-only CPU – an update from the HEPiX IPv6 WG
Oxford Site Report HEPSYSMAN
Update from the HEPiX IPv6 WG
New Technologies Liverpool Plans and Experience
A Messaging Infrastructure for WLCG
PES Lessons learned from large scale LSF scalability tests
WLCG Collaboration Workshop;
ETHZ, Zürich September 1st , 2016
IPv6 update Duncan Rand Imperial College London
HEPSYSMAN Summer th May 2019 Chris Brew Ian Loader
RHUL Site Report Govind Songara, Antonio Perez,
Pete Gronbech, Kashif Mohammad and Vipul Davda
Kashif Mohammad VIPUL DAVDA
Presentation transcript:

Pete Gronbech GridPP Project Manager April 2016 Oxford Experiences Pete Gronbech GridPP Project Manager April 2016

Oxford Grid Cluster GridPP4 status Autumn 2015 Current capacity 16,768HS06 1300TB DPM Storage a mixture of older Supermicro 26, 24 and 36 bay servers, but the major capacity is provided by Dell 510 and 710s. 12 bay with 2 or 4 TB disks. Majority of CPU nodes are ‘twin-squared’ Viglen Supermicro worker nodes have been installed. Intel E5- 8 core (16 Hyper-threaded cores each) provides 1300 job slots with 2GB RAM. The Grid Cluster now runs HT Condor behind an ARC CE. April 2016

Oxford’s Grid Cluster April 2016

April 2016

Primary VOs ATLAS LHCb CMS Tier-3 ALICE Support Others April 2016 Main priority 6th largest UK site (By metrics) ~1300 CPU cores with 1.3PB storage LHCb 6th Largest UK site (By metrics) CMS Tier-3 Supported by RALPPD’s PhEDEx server Useful for CMS, and for us, keeping the site busy in quiet times During 2015 delivered the same percentage as Bristol However can block Atlas jobs and during accounting period not so desirable ALICE Support There is a need to supplement the support given to ALICE by Birmingham. Made sense to keep this in SouthGrid so Oxford have deployed an ALICE VO box Oxford provides roughly 40% of ALICE CPU (No storage) Others 5th largest by Metrics April 2016

Other Oxford Work UK Regional Monitoring IPv6 Testing Oxford runs the nagios based WLCG monitoring for the UK These include the Nagios server itself, and support nodes for it, SE, MyProxy and WMS/LB Multi VO Nagios Monitoring. IPv6 Testing We have taken a leading part in the IPv6 testing, many services enabled and tested by the community. perfSONAR IPv6 enabled. RIPE Atlas probe also on IPv6. Cloud Development Openstack test setup (Has run Atlas jobs) VAC setup (LHCb, Atlas & GridPP Dirac server jobs) Viab now installed April 2016

Approx. FTE Grid Tasks HTCondor Batch 0.2 DPM Storage Monitoring and tuning in the batch system DPM Storage Installation, testing, monitoring and maintenance of SPACETOKENS, and h/w Grid Ops & Management Installation and management system (Cobbler, Puppet). GGUS tickets etc Ovirt VM infrastructure 0.1 SL based ovirt system setup to provide VM infrastructure for all service and test nodes. IPv6 testing IPv6 UI used by others in GridPP. IPv6 se, Perfsonar Early adopter and also ATLAS RIPE probes Security Team membership Meetings and running UK security tests VAC testing Early adopter and tester. Viab testing 0.05 Recent testing National Nagios infrastructure GridPP wide essential service VO Nagios for UK GridPP wide service Backup VOMS Server DPM on SL7 testing Part of storage group, early testing and bug discovery. Openstack Cloud test setup Open stack setup used for testing FTE ≈1.5 April 2016

Juggling Many Tasks Successfully April 2016

Plan for GridPP5 Manpower ramp down from 1.5FTE to 0.5 (average of 1FTE over GridPP5) Need to simplify the setup and reduce the number of development tasks April 2016

Plan for GridPP5 Storage Upgrade DPM to 1.8.10 running on SL6 fully puppetized (Initially tried to install on SL7 but found too many incompatibilities to deal with in the short time available) We were the first site to install the latest DPM with the puppet modules supplied by the DPM developers. This is a good thing but meant we were on the bleeding edge and fed back many missing features and bugs. Ewan worked with Sam to find all the missing parts. We are still finding out now about some of the issues such as publishing. Decommission out of warranty h/w April 2016

Storage directions Reduced emphasis on storage?? Heading in the direction of T2C, and were acting as a test site for this way of working. Atlas FAX is intended to deal with cases where the files are missing. Jobs could be sent to a site knowing the data is not there to force use of FAX, whether this scales is as yet unknown. Actually using xrootd redirection without an SE is as yet untested. Currently short of staff so we will continue in traditional mode. Question needs answering before next h/w round, do we continue in the T2C direction or actually act as a T2D for SouthGrid sites. We do have good networking and a fair sized resource currently. We have rather too much storage to be used just as cache. What do the experiments want from us? Will continue running an SE for the foreseeable future. April 2016

Decommissioning old storage 17 Supermicro servers removed from the DPM SE. (Reduction of 320TB new total 980TB) April 2016

Storage March 2016 Simplified  Old servers removed, switched off or repurposed Software up to date OS up to date Simplified  April 2016

Openstack I’m here, any work? Panda Yes, job WN’s 8 Atlas image OpenStack Infrastructure Head Node 8 P. Love Atlas account Store Atlas VM image Storage Atlas VM image CMS VM image This model is good if you have an existing Open Stack infrastructure. Helps with OS independence but quiet complex to setup and not straight forward to setup. Networking 8 April 2016

VAC I’m here, any work? Yes, job Install SL Install VAC rpms VAC Factory AKA WN’s Yes, job 8 Layers KVM Hypervisor - KVM Libvirt manager - libvirt Virt manager - VAC Vrsh - installation/management image Install SL Install VAC rpms scp config file 8 VAC layer – when should I start a VM? Do I have an empty slot, 30 seconds, yes start 1 VM, what type, (Different types Atlas, LHCb, GridPP…) Prioritisation defined in /etc/vac.conf (Actually now multiple config files) Uses the site squid. 8 April 2016

viab Configured mainly by web pages Only have to install first node manually (eg from USB stick) All the rest can boot from any of the other installed nodes. Each node runs dhcp, tftp and squid cache to act as an installation server. Everything comes from the web including certificates, but having the private part on the web would be bad so they are encrypted with a passphrase that is stored locally and must be copied to each node. Only have to copy contents of /etc/viabkeys and republish the rpm via the web page. All nodes network boot and always reinstall. Overall simpler setup. April 2016

viab April 2016

Vcycle Vcycle is VAC running on OpenStack – not tested at Oxford Could be useful if central Advanced research computing cluster starts offering an Open Stack setup. April 2016

Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. April 2016

CPU Upgrade March 2016 Lenovo NeXtScale 25 Nodes each with Dual E5-2640 v3 & 64GB RAM 800 new cores (new total ~2200) April 2016

Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. This rack is identical to the kit used by Oxford Advanced Research Computing. Initial plan is to plug into our Condor Batch as WNs When staff levels and time allows, we will investigate integrating the rack into the ARC cluster. April 2016

Can be part of a much bigger cluster

Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. This rack is identical to the kit used by Oxford Advanced Research Computing. Initial plan is to plug into our Condor Batch as WNs When staff levels and time allows, we will investigate integrating the rack into the ARC cluster. Initially stick with what we know ie HTCondor & viab Have to work out how to integrate with a shared cluster later. Decommission out of warranty hardware. Should we move away from the standard grid middleware to viab? April 2016

Reduction in Staff count Recent loss of staff shows We April 2016

Approx. FTE Grid Tasks HTCondor Batch 0.2 DPM Storage 0.15 Monitoring and tuning in the batch system DPM Storage 0.15 Installation, testing, monitoring and maintenance of SPACETOKENS, and h/w Grid Ops & Management Installation and management system (Cobbler, Puppet). GGUS tickets etc Ovirt VM infrastructure 0.1 SL based ovirt system setup to provide VM infrastructure for all service and test nodes. IPv6 testing Perfsonar Early adopter and also ATLAS RIPE probes Security Team membership VAC testing Viab testing 0.05 National Nagios infrastructure GridPP wide essential service VO Nagios for UK GridPP wide service Backup VOMS Server DPM on SL7 testing Openstack Cloud test setup FTE ≈1.05 April 2016

Conclusions A time of streamlining and rationalisation. Can continue as a useful site by sticking to core tasks. Need to make a decision on the storage question. Still a lot of work to do, to investigate integrating with university resources. Will this be possible, will it save time, or allow bursting to greater resources? Possibly benefits of cost savings on h/w maintenance and electricity costs. April 2016