Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Slides:



Advertisements
Similar presentations
London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,
Advertisements

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.
Project Status David Britton,15/Dec/ Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Scheduling under LCG at RAL UK HEP Sysman, Manchester 11th November 2004 Steve Traylen
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Tier-1 Overview Andrew Sansum 21 November Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
London Tier 2 Status Report GridPP 12, Brunel, 1 st February 2005 Owen Maroney.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Tier1 Report Cambridge 23rd October 2006 Martin Bly.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
Tier1A Status Andrew Sansum 30 January Overview Systems Staff Projects.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
Servizi core INFN Grid presso il CNAF: setup attuale
LCG Service Challenge: Planning and Milestones
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Update on Plan for KISTI-GSDC
Olof Bärring LCG-LHCC Review, 22nd September 2008
Bernd Panzer-Steindel CERN/IT
Presentation transcript:

Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007

Staff Changes Steve Traylen left in September Three new Tier-1 staff –Lex Holt (Fabric Team) –James Thorne (Fabric Team) –James Adams (Fabric Team) One EGEE funded post to operate a PPS (and work on integration with NGS): –Marian Klein

Team Organisation Grid Services Grid/Support Ross Condurache Hodges Klein (PPS) Vacancy Fabric (H/W and OS) Bly (team leader) Wheeler Holt Thorne White (OS support) Adams (HW support) CASTOR SW/Robot Corney (GL) Strong (Service Manager) Folkes (HW Manager) deWitt Jensen Kruk Ketley Bonnet 2.5 FTE effort Machine Room operations Networking Support Database Support (Brown) Project Management (Sansum/Gordon/(Kelsey))

Hardware Deployment - CPU 64 Dual core/dual CPU Intel Woodcrest 5130 systems delivered in November (about 550 KSI2K) Completed acceptance tests over Christmas and into production mid January CPU farm capacity now (approximately): –600 systems –1250 cores –1500 KSI2K

Hardware Deployment - Disk 2006 was a difficult year with deployment hold-ups: –March 2006 delivery: 21 servers, Areca RAID controller – 24*400GB WD (RE2) drives. Available: January 2007 – November 2006 delivery: 47 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted February 2007 (but still deploying to CASTOR) –January 2007 delivery:39 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted March Ready to deploy to CASTOR

Disk Deployment - Issues March 2006 (Clustervision) delivery: –Originally delivered with 400GB WD400YR drives –Many drive ejects under normal load test (had worked OK when we tested in January). –Drive specification found to have changed – compatibility problems with RAID controller (despite drive being listed as compatible) –Various firmware fixes tried – improvements but not fixed. –August 2006 WD offer to replace for 500YS drive –September 2006 – load test of new configuration begin to show occasional (but unacceptably frequent) drive ejects (different problem). –Major diagnostic effort by Western Digital – Clustervision also trying various fixes lots of theories – vibration, EM noise, protocol incompatability – various fixes tried (slow as failure rate quite low).. –Fault hard to trace, various theories and fixes tried but eventually traced (early December) to faulty firmware. –Firmware updated and load test shows problem fixed (mid Dec). Load test completes in early January and deployment begins.

Disk Deployment - Cause Western digital working at 2 sites – logic analysers on SATA interconnect. Eventually fault traced to a missing return in the firmware: –If drive head stays too long in one place it repositions to allow lubricant to migrate. –Only shows up under certain work patterns –No return following reposition and 8 seconds later controller ejects drive

Disk Deployment #ServersCapacity (TB) Jan Feb March Total138750

Hardware Deployment - Tape SL8500 tape robot upgraded to slots in August GRIDPP buy 3 additional T10K tape drives in February 2007 (now 6 drives owned by GRIDPP) Further purchase of 350TB tape media in February Total Tape capacity now TB (but not all immediately allocated – some to assist CASTOR migration – some needed for CASTOR operations.

Hardware Deployment - Network 10GB line from CERN available in August 2006 RAL was scheduled to attach to Thames Valley Network (TVN) at 10GB by November 2006: –Change of plan in November – I/O rates from Tier-1 already visible to UKERNA. Decide to connect T1 by 10Gb resilient connection direct into SJ5 core (planned mid Q1) –Connection delayed but now scheduled for end of March GRIDPP load tests identify several issues at RAL firewall. These resolved but plan is now to bypass the firewall for SRM traffic from SJ5. A number of internal Tier-1 topology changes while we have enhanced LAN backbone to 10Gb in preparation to SJ5

RAL Site x 5530 Router A OPN Router 3 x x ADS Caches CPUs + Disks CPUs + Disks CPUs + Disks CPUs + Disks 10Gb/s to CERN N x 1Gb/s 10Gb/s 5 x x RAL Tier 2 Tier 1 Oracle systems 1Gb/s to SJ4 Tier-1 LAN

New Machine Room Tender underway, planned completion: August M**2 can accommodate 300 racks + 5 robots 2.3MW Power/Cooling capacity (some UPS) Office accommodation for all E-Science staff Combined Heat and Power Generation (CHP) on site Not all for GRIDPP (but you get most)!

Tier-1 Capacity delivered to WLCG (2006)

Last 12 months CPU Occupancy +260 KSI2K May KSI2K January 2007

Recent CPU Occupancy (4 weeks) Air-conditioning Work (300KSI2K offline)

CPU Efficiencies

CMS merge jobs – hang on CASTOR ATLAS/LHCB jobs hanging on dCache Babar jobs running slow – reason unknown

3D Service Used by ATLAS and LHCB to distribute conditions data by Oracle streams RAL one of 5 sites who deployed a production service during Phase I. Small SAN cluster – 4 nodes, 1 Fibre channel RAID array. RAL takes a leading role in the project.

Reliability Reliability matters to the experiments. –Use the SAM monitoring to identify priority areas –Also worrying about job loss rates Priority at RAL to improve reliability: –Fix the faults that degrade our SAM availability –New exception monitoring and automation system based on Nagios Reliability is improving, but work feels like an endless treadmill. Fix one fault and find a new one.

Reliability - CE Split PBS server and CE long time ago Split CE and local BDII Site BDII times out on CE info provider –CPU usage very high on CE info provider starved –Upgraded CE to 2 cores. Site BDII still times out on CE info provider –CE system disk I/O bound –Reduce load (changed backups etc) –Finally replaced system drive with faster model.

CE Load

Job Scheduling Sam Jobs failing to be scheduled by MAUI –SAM tests running under operations VO, but share gid with dteam. dteam has used all resource – thus MAUI starts no more jobs –Change scheduling and favour ops VO (Long term plan to split ops and dteam) PBS server hanging after communications problems –Job stuck in pending state jams whole batch system (no job starts – site unavailable!) –Auto detect state of pending jobs and hold – remaining jobs start and availability good –But now held jobs impact ETT and we receive less work from RB – have to delete held jobs

Jobs de-queued at CE Jobs reach the CE and are successfully submitted to the scheduler but shortly afterwards CE decides to de-queue the job. –Only impacts SAM monitoring occasionally –May be impacting users more than SAM but we cannot tell from our logs –Logged a GGUS ticket but no resolution

RB RB running very busy for extended periods during the summer: –Second RB (rb02) added early November but no transparent way of advertising. Needs UIs to manually configure (see GRIDPP wiki). Jobs found to abort on rb01 linked to size of database –Database needed cleaning (was over 8GB) Job cancels may (but not reproducibly) break RB (RB may go 100% CPU bound) – no fix to this ticket.

RB Load rb02 deployed Drained to fix hardware rb02 High CPU Load

Top Level BDII Top level BDII not reliably responding to queries –Query rate too high –UK sites failing SAM tests for extended periods Upgraded BDII to two servers on DNS round robin –Sites occasionally fail SAM test Upgraded BDII to 3 servers (last Friday) –Hope problem fixed – please report timeouts.

FTS Reasonably reliable service –Based on a single server –Monitoring and automation to watch for problems At next upgrade (soon) move from single server to two pairs: –One pair will handle transfer agents –One pair will handle web front end.

dCache Problems with gridftp doors hanging –Partly helped by changes to network tuning –But still impacts SAM tests (and experiments). Decide to move SAM CE replica-manager test from dCache to CASTOR (cynical manoeuvre to help SAM) Had hoped this months upgrade to version 1.7 would resolve problem –Didnt help –Have now upgraded all gridftp doors to Java 1.5. No hangs since upgrade last Thursday.

SAM Availability

CASTOR Autumn 2005/Winter 2005: –Decide to migrate tape service to CASTOR –Decision that CASTOR will eventually replace dCache for disk pool management - CASTOR2 deployment starts Spring/Summer 2006: Major effort to deploy and understand CASTOR –Difficult to establish a stable pre-production service –Upgrades extremely difficult to make work – test service down for weeks at a time following upgrade or patching. September 2006: –Originally planned we have full production service –Eventually – after heroic effort CASTOR team establish a pre-production service for CSA06 October 2006 –But we dont have any disk – have to – BIG THANK YOU PPD! –CASTOR performs well in CSA06 November/December work on CASTOR upgrade but eventually fail to upgrade January 2007 declare CASTOR service as production quality Feb/March 2007: –Continue work with CMS as they expand range of tasks expected of CASTOR – significant load related operational issues identified (eg CMS merge jobs cause LSF meltdown). –Start work with Atlas/LHCB and MINOS to migrate to CASTOR

CASTOR Layout ralsrmaralsrmbralsrmcralsrmdralsrmeralsrmf D1T0 cmswanout D0T1prdD0T1tmpD0T1 CMSwanin cmsFarmRead lhcbD1T0 atlasD1T0prod atlasD1T0usratlasD1T1 atlasD0T1test atlasD1T0test SRM 1 Disk Pools service classes

CMS

Phedex Rate to CASTOR (RAL Destination)

Phedex Rate to CASTOR RAL Source

SL4 and gLite Preparing to migrate some batch workers to SL4 for experiment testing. Some gLite testing (on SL3) already underway but becoming increasingly nervous about risks associated with late deployment of forthcoming SL4 gLite release

Grid Only Long standing milestone that Tier-1 will offer a Grid Only service by the end of August Discussed at January UB. Considerable discussion WRT what Grid Only means. Basic target confirmed by Tier-1 board but details still to be fixed WRT exactly what remains as needed.

Conclusions Last year was a tough year but we have eventually made good progress. –A lot of problems encountered –A lot accomplished This year focus will be on: –Establishing a stable CASTOR service that meets the needs of the experiments –Deploying required releases of SL4/gLite –meeting (hopefully exceeding) availability targets –Hardware ramp up as we move towards GRIDPP3