Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.

Slides:

Advertisements

Similar presentations

Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.

Advertisements

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.

RAL Tier1 Operations Andrew Sansum 18 th April 2012.

Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.

Overview of DVX 9000.

Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.

Highly Available Central Services An Intelligent Router Approach Thomas Finnern Thorsten Witt DESY/IT.

Tier1 Site Report RAL June 2008 Martin Bly.

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.

INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

Module 9 PS-M4110 Overview <Place supporting graphic here>

RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.

Tier-1 Overview Andrew Sansum 21 November Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

Tier 3g Infrastructure Doug Benjamin Duke University.

Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.

Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

Step Arena Storage Introduction. 2 HDD trend- SAS is the future Source: (IDC) Infostor June 2008.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

RAL Tier1 Report Martin Bly HEPSysMan, RAL, June

Planning and Designing Server Virtualisation.

MDC-B350: Part 1 Room: You are in it Time: Now What we introduced in SP1 recap How to setup your datacenter networking from scratch What’s new in R2.

RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.

Tier1 Status Report Martin Bly RAL 27,28 April 2005.

Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility HEPiX – Fall, 2005.

RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.

Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.

Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

HEPix April 2006 NIKHEF site report What’s new at NIKHEF’s infrastructure and Ramping up the LCG tier-1 Wim Heubers / NIKHEF (+SARA)

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.

RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.

RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.

RAL Site Report HEPiX Spring 2012, Prague April Martin Bly, STFC-RAL.

BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

RAL Site Report Martin Bly SLAC – October 2005.

RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.

IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.

ORNL is managed by UT-Battelle for the US Department of Energy OLCF HPSS Performance Then and Now Jason Hill HPC Operations Storage Team Lead

RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.

Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.

Experience of Lustre at QMUL

Mattias Wadenstein Hepix 2012 Fall Meeting , Beijing

IT-DB Physics Services Planning for LHC start-up

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Update on Plan for KISTI-GSDC

Oxford Site Report HEPSYSMAN

HPEiX Spring RAL Site Report

Статус ГРИД-кластера ИЯФ СО РАН.

GridPP Tier1 Review Fabric

Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017

Presentation transcript:

Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL

30/06/2011Tier1 Site Report - HEPSysMan Summer 2011 Overview RAL Stuff Building stuff Tier1 Stuff

RAL Addressing: –Removal of old-style addresses in favour of the cross-site standard (Significant resistance to this) –No change in aim to remove old-style addresses but... –... mostly via natural wastage as staff leave or retire –Staff can ask to have their old-style address terminated Exchange: –Migration from Exchange 2003 to 2010 went successfully Much more robust with automatic failover in several places Mac users happy as Exchange 2010 works directly with Mac Mail so no need for Outlook clones –Issue for exchange servers with MNLB and switch infrastructure Providing load-balancing Needed very precise instructions for set up to avoid significant network problems 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Building Stuff UPS problems –Leading power factor due to switch-mode PSUs in hardware –Causes 3KHz ‘ringing’ on current, all phases (61 st harmonic) Load is small (80kW) compared to capacity of UPS (480kVA) –Most kit stable but EMC AX4-5 FC arrays unpredictably detect supply failure and shut down arrays –Previous possible solutions abandoned in favour of: –Local isolation transformers in feed from room distribution to in-rack distribution: Works! 30/06/2011 Tier1 Site Report - HEPSysMan Summer 2011

Tier1 New structure within e-Science: –Castor Team moved into Data Services group under Dave Corney –Other Tier1 teams (Fabric, Services, Production) under Andrew Sansum Some staff changes: –James Thorne, Matt Hodges, Richard Hellier left –Jonathan Wheeler passed away –Derek Ross moved to SCT on secondment Recruiting replacements 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Networking Site –Sporadic packet loss in site core networking (few %) Began in December, got steadily worse Impact on connections to FTS control channels, LFC, other services Data via LHCOPN not affected other than by control failures –Traced to traffic shaping rules used to limit bandwidth in firewall for site commercial tenants. These were being inherited by other network segments (unintentionally!) –Fixed by removing shaping rules and using a hardware bandwidth limiter –Currently a hardware issue in link between SAR and firewall causing packet loss Hardware intervention Tuesday next week to fix LAN –Issue with a stack causing some ports to block access to some IP addresses: one of the stacking ports on the base switch faulty –Several failed 10GbE XFP transceivers 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Networking II Looking at structure of Tier1 network –Core with big chassis switches, or –Mesh with many top-of-rack switches –Want make use of 40GbE capability in new 10GbE switches –Move to have disk servers, virtualisation servers 10GbE as standard Site core network upgrades approved –New core structure with 100GbE backbones and 10/40GbE connectivity available –Planned for next few years 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

‘Pete Facts’ Tier1 is a subnet of the RAL /16 network Two overlaid subnets: /21 and /21 Third overlaid /22 subnet for Facilities Data Service –To be physically split later as traffic increases Monitoring: Cacti with weathermaps Site SJ5 link: –20Gb/s + 20Gb/s failover –direct to SJ5 core –two routes (Reading, London) T1 OPN link: 10Gb/s + 10Gb/s failover, two routes T1 Core 10GbE T1 SJ5 bypass: 10Gb/s T1 PPD-T2: 10GbE Limited by line speeds and who else needs the bandwidth 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

FY 10/11 Procurements Summary of previous report(s): –36 SuperMicro 4U 24-bay chassis with 2TB SATA HDD (10GbE) –13 x SuperMicro Twin²: 2 x X5650, 4GB/core, 2 x 1T HDD –13 x Dell C6100: 2 x X5650, 4GB/core, 2 x 1T HDD –Castor (Oracle) databases server refresh: 13 x Dell R610 –Castor head nodes: 16 x Dell R410 –Virtualisation: 6 x Dell R510, 12 x 300GB SAS, 24GB RAM, 2 x E5640 New since November –13 x Dell R610 tape servers (10GbE) for T10KC drives –14 x T10KC tape drives –Arista 7124S 24-port 10GbE switch + twinax copper interconnects –5 x Avaya 5650 switches + various 10/100/1000 switches 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Tier1 Hardware Summary Batch: –~65,000 HS06 from ~6300 cores –750 systems (2,4,6 per chip, 2 chips/system) –2, 3 or 4 GB RAM per core –Typically at least 50GB disk per core, some with two disks –1GbE per system Storage: –~8000TB in ~500+ servers –6,9,18,38 or 40TB(ish) per server –1GbE, 2010 generation on 10GbE –10000-slot tape library, 500GB, 1TB or 5TB per cart Network: –Force10 C300 switch(es) in core –Stacks of Avaya (Nortel) 55xx and 56xx –Arista and Fujitsu 10GbE switches Services: –Mix of mostly old IBM and Transtec, mid-age SM twins and Transtec, and newer Dell systems 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Castor Status Stats as of June 2011: –16 million files: Used/Total capacities: 3.2PB/7PB on tape and 3.7PB/7.5PB on disk Recent news: –Major upgrade (2.1.9) during late 2010, which brought us: Checksums for all files, xrootd support, proper integrated disk server draining –Minor upgrade ( ) during February 2011 with bugfixes –Minor upgrade ( ) next week, which brings us T10KC support –New (non-Tier1) production instance for Diamond synchrotron. Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web frontend to access data (TopCAT) Coming up: –Move to new database hardware and better resilient architecture (using DataGuard) later this year for Tier-1 databases 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Storage Issues One of two batches of the FY09/10 capacity storage failed acceptance testing: 60/98 servers (~2.2PB)  –Cards swapped (LSI -> Adaptec) –Released for production use SL08 batch failing in production over extended period –Single-drive throws cause array lock up and crash (array + data loss) –Whole batch (50/110) rotated out of production (data migrated) –Updated Areca firmware, recreate arrays from scratch, new file systems etc –Subjected to aggressive acceptance tests –Passed, but... –... issue with controller ‘crashing ports for failed drives ~50% of cases Seeking fix from Areca –Available for production in D0T1 service classes and accept some drive throws will need a reboot to see new disks 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Virtualisation Evaluated MS Hyper-V (inspired by CERN's successes) for services virtualization platform –Offers sophisticated management/failover etc without punitive cost of VMWare –However as Linux admins, sometimes hard to know if problems are due to ignorance of the MS world Struggled for a long time with Infortrend iSCSI storage arrays (and poor support) –abandoned them recently and problems seem resolved –Evaluating Dell EqualLogic units on loan Have learnt a lot about administering Windows servers.... Ready to implement production platform for Local storage HVs –6 x Dell R510, 12 x 300GB SAS, 24GB RAM, 2 x E5640 for local storage HV –(14 x Dell R410, 4 x 1TB SATA, 24GB RAM for shared storage HV) 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Projects Quattor –Batch and Storage systems under Quattor management ~6200 cores, 700+ systems (batch), 500+ system (storage) Significant time saving –Significant rollout on Grid services node types CernVM-FS –Major deployment at RAL to cope with software distribution issues –Details in talk by Ian Collier (next!) Databases –Students working on enhancements to the hardware database infrastructure 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011

Questions? 30/06/2011Tier1 Site Report - HEPSysMan Summer 2011