RHUL Site Report Govind Songara, Antonio Perez,

Slides:



Advertisements
Similar presentations
Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.
Advertisements

National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.
Computer Cluster at UTFSM Yuri Ivanov, Jorge Valencia.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010.
DatacenterMicrosoft Azure Consistency Connectivity Code.
Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Tier 3g Infrastructure Doug Benjamin Duke University.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
RAL PPD Site Update and other odds and ends Chris Brew.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
S. Pardi Computing R&D Workshop Ferrara 2011 – 4 – 7 July SuperB R&D on going on storage and data access R&D Storage Silvio Pardi
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
Royal Holloway site report Simon George RAL Jun 2010.
KOLKATA Grid Kolkata Tier-2 Status and Plan Site Name :- IN-DAE-VECC-02 Gocdb Name:- IN-DAE-VECC-02 VO :- ALICE City:- KOLKATA Country :-
Academia Sinica Grid Computing Centre (ASGC), Taiwan
Australia Site Report Lucien Boland Goncalo Borges Sean Crosby
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Review of the WLCG experiments compute plans
COMPUTING FOR ALICE IN THE CZECH REPUBLIC in 2016/2017
Experience of Lustre at QMUL
Pete Gronbech GridPP Project Manager April 2016
Bob Ball/University of Michigan
The Beijing Tier 2: status and plans
Mattias Wadenstein Hepix 2012 Fall Meeting , Beijing
Operations and plans - Polish sites
HEPSYSMAN Summer June 2017 Ian Loader Chris Brew
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Andrea Chierici On behalf of INFN-T1 staff
Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
Stuart Wild. Particle Physics Group Meeting, January 2010.
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Experience of Lustre at a Tier-2 site
Oxford Site Report HEPSYSMAN
Статус ГРИД-кластера ИЯФ СО РАН.
Artem Trunov and EKP team EPK – Uni Karlsruhe
PES Lessons learned from large scale LSF scalability tests
AGLT2 Site Report Shawn McKee/University of Michigan
Small site approaches - Sussex
US CMS Testbed.
GridPP Tier1 Review Fabric
Integration of Singularity With Makeflow
ETHZ, Zürich September 1st , 2016
CHIPP - CSCS F2F meeting CSCS, Lugano January 25th , 2018.
WLCG Tier-2 site at NCP, Status Update and Future Direction
HEPSYSMAN Summer th May 2019 Chris Brew Ian Loader
Pete Gronbech, Kashif Mohammad and Vipul Davda
Kashif Mohammad VIPUL DAVDA
HEPSYSMAN Summer June 2017 Ian Loader Chris Brew
Presentation transcript:

RHUL Site Report Govind Songara, Antonio Perez, Simon George, Barry Green, Tom Crane HEP Sysman meeting @ RAL , June 2018

Manpower Antonio - Tier-3/Tier2 Sysadmin - 1FTE . Govind - Tier-2 Grid Admin - 0.5 FTE. Simon - Site Manager. Barry - Hardware/Network Specialist. Tom - All rounder. 2

Group Activities Accelerator ATLAS Small DAQ systems. Benefit from strong collaboration in software support. Large Tier3 batch compute and storage resources for data analysis.   DAQ test systems. Dark Matter Detector development (lab DAQ systems).   Growing need for compute and storage resources to analyse data.   Need help with things like s installation, data movement. Accelerator Small DAQ systems. Many small activities around the world which generates unique and valuable data sets. Simulation: both embarrassingly parallel and multi-process (MPI) computing.   Software development infrastructure (e.g. cdash server). Theory Occasional use of Tier3 cluster. Interest in MPI. 3

Tier-2 9 Racks @ Huntersdale site Upgraded from 8 to 9 racks. The additional rack is off to the right of the photo. The installation by Dell has been a struggle (wrong length cables had to be redone, inability to change delivery location/instructions, labelling instructions not passed on) . 4

Tier-2 Grid 9 racks in modern machine room provided by central IT service. New Kit: 10 x Dell PowerEdge C6420, Xeon(R) Gold 6148 CPU @ 2.40GHz, 40core, SL6-HS06-score: 966 / CC7-HS06-Score 1004 with HT 48 kHS06 in 150 WN; Cream/Torque CE x 2, moving to ARC/HTC 1.4 PB DPM SE in 47 servers 8 misc servers including 3 VM hosts running standard network services, provisioning, Grid services. Network restructured. Using stacking to solve some reliability problems. See next slides. 0.5 FTE Vac-in-a-box nodes updated to latest version and CC7. Storage Pledge – Compute Pledge – 1 more rack. 10 more worker nodes added with about 10k HepSpecs score. So the total is about 48k hepspecs. Our intention is to move from Torque to ARC/HTCondor 10 x New kit added. DPS = Disk pool Manager This is the projection of our capacity, we will be adding more CPU power in the future in instead of storage. Thanks to Andrew MacNabb Vac in a box nodes were updated to CC7 and last VIAB version. Currently is running in about 10 old nodes. The need to have custom routing to put storage traffic on our private network is preventing us from expand this. 5

Network design 2016 Network design from last year. Cacti doesn't show bottlenecks. Outages caused by individual routers. Connection to Tier3 Slow.

Network design 2018 Moved from 2 switches trunked at 4x40 Gb/s links to a ring of 3 switches with 2x40Gb/s links between them. This configuration prevents downtime and improves the fail over. 10 new nodes added.

Tier-2 Grid Issues Planning Some services impacted due to the network upgrade. Long term expansion limited by rack space and cooling, need to decommissioned old kit to make room. The need to have custom routing to put storage traffic on our private network currently prevents us from expanding VIAB. Planning CC7 Deployment server, services, cluster, etc. CC7 DPM pool node upgrade IPV6 roll out still on going. The installation by Dell has been a struggle (wrong length cables had to be redone, inability to change delivery location/instructions, labelling instructions not passed on) . During the implementation of the network some services were impacted. The expansion will force us to retire old nodes in order to place new ones. Planning Update all our CC6 to CC7. Upgrade our Disk Pool Manager. IPV6 works still on going. 8

Tier -3 Network mapping – NetDisco, lldp on all hosts 8 racks packed into home-made machine room, a long way from Tier2 on 1 Gb/s link 6 kHS06 in ~100 old WN, mostly hand downs from Tier2, some upgraded; Torque. Storage: 304TB Hadoop using 90 WNs 262 TB NFS scratch over 7 servers 11 TB NFS Home 11 servers running standard network services, mainly as Vms. Hadoop updated from 2.4.1 to 2.7.3 Network mapping – NetDisco, lldp on all hosts Hadoop Hadoop: Updated from 69 to 90 nodes Hadoop from 80Tb to 304Tb Scratch updated with 80Tb Hadoop upgrade. Problems. Several services running in VMs: linappservs1, 2 and 3, squids, NAT, DB, CUPS, Hadoop updated. Scratch 9

Tier-3 Issues Planning Migrate batch system from torque to HTCondor. Aircon failures. Shut down script creation. Hadoop mismatch version between namenode and datanodes prevented load balancer to work. Issue found in HTCondor setup. Problem found with HTCondor and AMD cpus where a HTCondor was crashing only when running under AMD. Planning Migrate batch system from torque to HTCondor. Hadoop test environment update to version 3. Aircon failure forced us to shutdown the nodes and leave only critical services. Someone had to go to the uni to leave doors opened to avoid more damage. Hadoop wasn't able to balance the capacity of the nodes. So the update of some nodes failed because the datanode wasn't able to start due to full filesystem. Planning Migrate batch system from PBC to HTCondor. Upgrade Hadoop test environment. 10

HPC New HPC located in Huntersdale. 1 xcat node + 11 nodes: 20 x Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 62GB RAM. 22Tb of storage. Running mainly, SLURM and easybuild under Centos 7.2. Using Salt as a configuration manager. Centos 7 AD authentication setup on going. Big problem with massive disk failure leading to a system rebuilt. On going tests to link it with Computer Center AD.

Thank You