RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.

Slides:



Advertisements
Similar presentations
Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.
Advertisements

Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Gareth Smith RAL PPD HEP Sysman. April 2003 RAL Particle Physics Department Site Report.
Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RAL Site Report HEPiX Spring 2012, Prague April Martin Bly, STFC-RAL.
BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.
July 18, 2011S. Timm FermiCloud Enabling Scientific Computing with Integrated Private Cloud Infrastructures Steven Timm.
CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Title of the Poster Supervised By: Prof.*********
Experience of Lustre at QMUL
Pete Gronbech GridPP Project Manager April 2016
Belle II Physics Analysis Center at TIFR
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
INFN Computing infrastructure - Workload management at the Tier-1
Andrea Chierici On behalf of INFN-T1 staff
Update on Plan for KISTI-GSDC
Experience of Lustre at a Tier-2 site
Luca dell’Agnello INFN-CNAF
Support for IPv6-only CPU – an update from the HEPiX IPv6 WG
Olof Bärring LCG-LHCC Review, 22nd September 2008
HPEiX Spring RAL Site Report
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
UK Status and Plans Scientific Computing Forum 27th Oct 2017
Southwest Tier 2.
BEIJING-LCG2 Site Report
GridPP Tier1 Review Fabric
HEPiX IPv6 Working Group F2F Meeting
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
IPv6 update Duncan Rand Imperial College London
Frontier Status Alessandro De Salvo on behalf of the Frontier group
Presentation transcript:

RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL

22/06/16HEP SYSMAN June RAL Site Report

Tier1 Hardware CPU: ~140k HS06 (~14.8k cores) –FY 15/16: additional ~106k HS06 in test –E5-2640v3 and E5-2630v3 (Supermicro, HPE) 4 sleds in a 2U chassis Back to 1G NICs 4GB per virtual core; 2 hard disks (RAID0) per server. –~250 kHS06 in use in July Storage: ~16.5 PB disk –FY 15/16: additional ~13.3PB raw –CEPH spec 2 x CPU, 64GB RAM, 36 x 6TB HDD, SAS HBA, 2 x 2 port 10G NICs 4 year warranty. Tape: 10k slot SL8500 (one of two in system) –44PB (T10KC/D) –Migrations to D-only started Atlas (~6PB) ongoing, LHCb (~3PB) to follow – estimated 1 month/PB 22/06/16HEP SYSMAN June RAL Site Report

Disk Servers Castor Disk Servers – RAID 6 data array: –Recent batches with the LSI Megaraid cards –In past (some years ago) we have generally not updated firmware unless problem. However now: Make sure we have new version tested in preprod. Update opportunistically – but not many “opportunities” –…. older servers seem to require firmware updates. –Not all vendors disk drives seem to show same rate of failures with age. Newest disk servers (for CEPH) - without RAID cards: –Need to re-work Nagios monitoring as we currently monitor the RAID cards to get –New test found using SMART (smartmon) – needs some work to tailor for what we want. 23/06/16HEP SYSMAN June RAL Site Report

Tier1 Network 23/06/16HEP SYSMAN June RAL Site Report

OPN Link – Last Month 23/06/16HEP SYSMAN June RAL Site Report

Networking Tier1 LAN –Router plans –Phase 3: move the firewall bypass and OPN links to Tier1 routers Scrapped – routers don’t do what we need –Plan B: Replace ancient UKLR router –Stacked S4810P –Will provide 40Gb/s pipe to border –Direct connection to Tier1 core –Landing point for OPN Expansion capability of Tier1 network topology limited – need to look at moving from L2 to L3 mesh or other architectures More frequent saturation of 10Gbit OPN link to CERN –Investigating options 22/06/16HEP SYSMAN June RAL Site Report

Services Batch system (HTCondor) –Developed a new method for draining worker nodes for multi-core jobs, enabling us to run pre-emptable jobs on the cores which would otherwise be idle Running in production since late last year Using this to run ATLAS event service jobs Mesos –Project to investigate management of services using Mesos Container cluster manager CVMFS Soon: larger storage for Stratum1 service (WLCG etc) FTS –Two instances Production and “test” (used by Atlas). 22/06/16HEP SYSMAN June RAL Site Report

Services - Continues LFC – still running (for non-LHC VOs) WMS: –Just decommissioning one – leaving two production WMS in service. 23/06/16HEP SYSMAN June RAL Site Report

Castor v –stable running for the start of run 2 –Major improvements in data throughput from disk thanks to scheduling optimisations –OS (SL6) and Oracle version upgrades for entire system v in test –Had series of problem – tests ongoing. –Updating (SL5) SRMs delayed until this is done. Starting integration work with the new Echo (Ceph) service 22/06/16HEP SYSMAN June RAL Site Report

Cloud and Ceph Cloud –Production service using OpenNebula, storage based on CEPH. –Department and wider use in STFC, including LOFAR –Infrastructure contribution from ISIS CEPH –Production level service underpinning Cloud infrastructure –Working towards deployment for large scale science data storage Based on SL7 and Jewel version of CEPH. Authorization code for xroot and gridftp plugins being worked on. 22/06/16HEP SYSMAN June RAL Site Report

Virtualisation Windows Hyper-V 2012 failover cluster is production ready –Beginning move to Windows Hyper-V 2012, but see VMWare as longer term solution –5 hosts, 130 logical cores, 1280GB RAM total –Tiered EqualLogic arrays (82TB total) Eql-varray2: 62TB, 7.2K SATA Eql-varray3: 22TB, 10k SAS 22/06/16HEP SYSMAN June RAL Site Report

IPV6 10 systems testbed Dual stack –UI, Ceph gateway, squid, message broker, GoOCDB testing, … 10 more systems to be added –Frontier testing to begin OPN router replacement should enable better IPv6 connectivity options 22/06/16HEP SYSMAN June RAL Site Report

Miscellany Tier1 funded by GridPP project –Grant renewed for FY16/17 to FY19/20 (GridPP5) Some internal reorganisation of staff grouping within Systems Division in Scientific Computing Department –No affect on Tier1 Service New mobile provider for STFC (UK Gov) –Phased migration in progress, no problems Old-style RAL addresses withdrawn no longer works in most cases 22/06/16HEP SYSMAN June RAL Site Report

Diamond Data Diamond Light Source (DLS) –UK's national synchrotron science facility SCD provides an archive for DLS data –1.1 Billion files / 4.7PB –Metadata is catalogued by ICAT (icatproject.org) –Data stored within CASTOR at RAL –Experimental data can be recalled from tape and downloaded at a later date from scientists’ home institutions 22/06/16HEP SYSMAN June RAL Site Report