Tier1 Site Report HEPSysMan, RAL 10-11 June 2010 Martin Bly, STFC-RAL.

Slides:



Advertisements
Similar presentations
Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.
Advertisements

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.
Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Tier1 Site Report RAL June 2008 Martin Bly.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
1 Northwestern University Information Technology Data Center Elements Research and Administrative Computing Committee Presented October 8, 2007.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
RAL PPD Site Update and other odds and ends Chris Brew.
PPD Computing “Business Continuity” David Kelsey 3 May 2012.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL Site Report HEPiX Spring 2012, Prague April Martin Bly, STFC-RAL.
BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
Tier1 Databases GridPP Review 20 th June 2012 Richard Sinclair Database Services Team Leader.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.
Oracle & HPE 3PAR.
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Oxford Site Report HEPSYSMAN
Olof Bärring LCG-LHCC Review, 22nd September 2008
HPEiX Spring RAL Site Report
The INFN Tier-1 Storage Implementation
GridPP Tier1 Review Fabric
Presentation transcript:

Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL

10 June 2010Tier1 Site Report - HEPSysMan June 2010 Overview RAL Stuff Building stuff Tier1 Stuff

STFC Developments Shared Services Centre –Provides common services to all the Research Councils: HR, Payroll, Procurement, Appraisal... –Aim is to provide savings over ‘doing it ourselves’ and by adopting common standards STFC HR, payroll, procurement now running via SSC –Mostly positive experience for procurement –Program underway to look at ‘IT’ procurement needs and processes Science computing difficult to fit into this model –Framework agreements being set up Speedier procurement of small batches Wider selection types of IT kit available – not just servers But may limit choice Need to track competitiveness against outsiders –Do other sites use frameworks and is it a benefit? 10 June 2010Tier1 Site Report - HEPSysMan June 2010

RAL Site Plan to remove support for old-style addresses in favour of the cross-site standard - still under –significant resistance to this –no date as yet International Space Innovation Centre “A new environment where government, universities and industry can work together to better exploit scientific opportunities and create new technologies, applications and IP that exploits UK, European and Global Space Programmes” Will focus on three key areas initially: –Understanding and countering Climate Change –Ensuring the Security of space systems and services –Exploiting the data generated by Earth Observation satellites 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Building Stuff UPS problems –Leading power factor due to switch-mode PSUs in systems –Causes 3KHz ringing on current, all phases Load is small (80kW) compared to capacity of UPS (480kVA). –Most kit stable but EMC AX4-5 FC arrays unpredictably detect supply failure and shut down arrays –Longer feed cable from UPS to PDU has made 50% reduction in distortion –Proposed long-term solution: isolation transformer New chillers + pumps –Additional 1500kW cooling capacity: 4 chillers in 3+1 redundant configuration to provide at total of 2250kW –Additional and replacement higher capacity pumps fitted Dust in machine rooms 10 June 2010Tier1 Site Report - HEPSysMan June 2010

10 June 2010Tier1 Site Report - HEPSysMan June 2010

FY 09/10 Procurements 98 x SuperMicro 4U/24-bay chassis with 2TB SATA HDD –RAID6 (40TB) or RAID6+1 (38TB) on Adaptec or LSI controllers –~3.8PB useable, Site => ~8PB 21 x SuperMicro Twin², 2 x E5520, 3GB/core, 1T HDD 21 x HP SL up chassis, 2 x E5520, 3GB/core, 1T HDD –~15,500 HS06, Site => ~49,700 HS06, ~5260 cores FC storage arrays: 7 x Infortrend iSCSI arrays: 2 x Infortrend 40+ Dell R410 servers –Tape servers, virtualisation and services hosts 5 HP DL360 database servers with FC 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Networking Additional Force10 C300 –Twin for existing C300 –‘Spares’, testing –To be configured as a resilient routing pair WAN developments –Existing 10Gb/s site link to SuperJanet5 being doubled to 20Gb/s Failover capacity maintained –LHCOPN failover 10Gb/s –Both being commissioned LAN developments –RAL campus network-futures plan in development 10 June 2010Tier1 Site Report - HEPSysMan June 2010

10 June 2010Tier1 Site Report - HEPSysMan June 2010

Storage Commissioning I One of two batches of the FY08/09 capacity storage failed acceptance testing: 60/110 servers (~1.2PB) –Rapid onset of failure mode Supplier involved –Significant testing with various ‘solutions’ unsuccessful –Escalated to HDD OEM –Additional testing program started HDD FW variants –Unable to identify cause of problems Supplier/OEM agree swap out of disks –Combination clearly doesn’t work despite compatibility list! 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Storage Commissioning II One of two batches of FY09/10 capacity storage delayed in proving tests: 60/98 servers (~2.3PB) –Multiple drive throws –Disks tested show now issues Supplier and OEMs involved New firmware for both controllers and drives –Flash in-situ Successful proving test Now in acceptance testing 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Drive Failure Rates 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Benchmarks 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Fabric Management Deployment with QUATTOR Already Deployed: Worker nodes, Batch server, UK-BDII, LHCb VO box In Progress now: Disk Servers, CEs, LFC, Local BDII Plan: All system types Quattorised by December 2010 Final catch up rollout during end of year stop 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Grid Services SL4 to SL5 migration for batch workers FroNTier deployment LFC cleanup FTS upgrade (to version 2.2.3) glexec deployment, SCAS/pilot roles in progress. Updates to hardware used for Top-BDII quintet. Other improvements in resilience e.g. enabling hot- swappable disks on WMS… 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Castor Done: –Nameserver upgrade to (September) –SRM update to (October) –Database disk array problem and subsequent rewinding of database. (October) –Oracle database quarterly patching (November) –Move back to original disk arrays (January) –Update memory in database nodes (February) To come: –New database architecture and hardware –Updates to Castor (2.1.8/2.1.9) 10 June 2010Tier1 Site Report - HEPSysMan June 2010

Questions? 10 June 2010Tier1 Site Report - HEPSysMan June 2010