CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.

Slides:



Advertisements
Similar presentations
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Advertisements

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
IT Department 29 October 2012 LHC Resources Review Board2 LHC Resources Review Boards Frédéric Hemmer IT Department Head.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
CERN - IT Department CH-1211 Genève 23 Switzerland t The High Performance Archiver for the LHC Experiments Manuel Gonzalez Berges CERN, Geneva.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Metalink for Tier 1 Miguel Anjo Database mini workshop 26.January.2007.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 6 th July 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Castor development status Alberto Pace LCG-LHCC Referees Meeting, May 5 th, 2008 DRAFT.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.
ALMA Archive Operations Impact on the ARC Facilities.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
CERN IT Department CH-1211 Genève 23 Switzerland PES SVN User Forum David Asbury Alvaro Gonzalez Alvarez Pawel Kacper Zembrzuski 16 April.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
STATUS OF KISTI TIER1 Sang-Un Ahn On behalf of the GSDC Tier1 Team WLCG Management Board 18 November 2014.
DJ: WLCG CB – 25 January WLCG Overview Board Activities in the first year Full details (reports/overheads/minutes) are at:
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES CVMFS deployment status Ian Collier – STFC Stefan Roiser – CERN.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
PIC port d’informació científica Luis Diaz (PIC) ‏ Databases services at PIC: review and plans.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
Patricia Méndez Lorenzo Status of the T0 services.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
LCG Service Challenge: Planning and Milestones
Service Challenge 3 CERN
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Bernd Panzer-Steindel CERN/IT
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC Mini Review, 1 st July 2008

CERN IT Department CH-1211 Genève 23 Switzerland t Agenda Resource Ramp-up CASTOR Performance and Metrics Power Issues and Progress Tier0 Status - 2

CERN IT Department CH-1211 Genève 23 Switzerland t Agenda Resource Ramp-up CASTOR Performance and Metrics Power Issues and Progress Tier0 Status - 3

CERN IT Department CH-1211 Genève 23 Switzerland t Elonex Issues Resource Ramp-up – Tier0 purchasing affected by two issues Elonex bankruptcy Disk server problems under heavy load CASTOR Performance and Metrics Power Issues and Progress Tier0 Status - 4

CERN IT Department CH-1211 Genève 23 Switzerland t Ramp-up: Problems Elonex Bankruptcy – One disk server order rapidly switched to alternative suppliers. – Second disk server order plus CPU order switched to alternative suppliers after FC in March. Disk server load issues – Problems brought to light by improvements to hardware burn-in procedure. Load to provoke issues significantly exceeds normal load on disk servers. Previous generation servers also show the problems with extremely high load. New capacity now released to deployment (and many servers have run well for some time with no issues. Tier0 Status - 5

CERN IT Department CH-1211 Genève 23 Switzerland t Ramp-up: Current state CPU – 100% of pledge delivered in early May, i.e. with one month delay Disk – 52% of pledge delivered to experiments in early May. – Balance of pledge is at CERN and will deployed progressively in coming weeks. – Delay, but minimal impact on CCRC exercise Tape – 100% of pledge available well before April 2009 Procurements – On schedule: tenders for September FC adjudication opened; tenders for December adjudication to be sent out shortly. Tier0 Status - 6

CERN IT Department CH-1211 Genève 23 Switzerland t Agenda Resource Ramp-up CASTOR Performance and Metrics Power Issues and Progress Tier0 Status - 7

CERN IT Department CH-1211 Genève 23 Switzerland t Agenda Resource Ramp-up CASTOR Performance and Metrics – CASTOR Service – SRM Interface – CASTOR metrics Power Issues and Progress Tier0 Status - 8

CERN IT Department CH-1211 Genève 23 Switzerland t CASTOR Service Tier0 Status - 9 Overall disk cache throughput: CASTORCMS disk cache throughput: Three problems with the garbage collection mechanism for CMS contributed to large peaks of internal traffic… … but showed the I/O capacity of the CMS setup (peaks of 9GB/s (in+out) CASTOR ran well throughout the May CCRC – although end-user load seemed low... In general incidents (<10) were detected rapidly, resolved quickly and only affected a single CASTOR instance. – The exception was a problem on the CASTOR public service which impacted ATLAS as the SRM interface was shared; this configuration was due to hardware (non-)availability; the planned dedicated SRM interfaces are being deployed. Desired operational improvements (notably tape request prioritisation) are being deployed with promising initial results.

CERN IT Department CH-1211 Genève 23 Switzerland t SRM Interface --- I Tier0 Status - 10 When it works it works well A large volume of data was transferred The average rate was high Reliability is still an issue ~10 incidents with impact ranging from service degradation to complete unavailability

CERN IT Department CH-1211 Genève 23 Switzerland t SRM Interface --- II Tier0 Status - 11 May 5 – redundant SRM back-ends lock each other in database [ALL VOs] May 13 th – lack of space on SRM DB [LHCb] May 13th – DB “extreme locking” / DB deadlocks [ALL VOs] May 9 th, May 14 th, May 19 th – SRM ‘stuck’ / no threads to handle requests [ATLAS] May 21 st, May 24 th – slow stager backend causes SRM stuck / DB overload [ All VOs] May 30th – get Timeouts due to slowness on Castor backend [ATLAS, LHCb] 3 times in May – problematic use of soft pinning caused GC problems [CMS] June 6 th – patch update crashed backend servers [ATLAS, ALICE, CMS] To be improved: Better resiliency to problems More service decoupling Some bugs need to be fixed Better testing needs to be done

CERN IT Department CH-1211 Genève 23 Switzerland t SRM Interface --- III Tier0 Status - 12 Separate out LHC VOs from shared instance Migrate all SRM databases to Oracle RAC (done for ATLAS) Upgrade to SRM 2.7 and deploy on SLC4 – Redundant backends – Uses CASTOR API which allows deployment of redundant stager daemons – Deploy fixes for identified bugs Configure SRM DLF to send logs to appropriate stager DLF – Improve our debugging response time Continue improving service monitoring Done In test In (very early) test * * * Required for “time to turl” metric; could be delivered in ~1 month.

CERN IT Department CH-1211 Genève 23 Switzerland t CASTOR Metrics Metric implementation continues as collaboration between developers and operations teams. Improved instrumentation rolled out during CCRC in May – so few measurements for LHCb or ALICE – no upload to Lemon; excel plots only New Lemon sensor for CASTOR will be deployed in the near future to deliver automatic generation of metric plots – (this version centralises all daemon monitoring, so needs the CASTOR release.) The following slides show a selection of metric plots which cover performance and issues during the May CCRC. Tier0 Status - 13

CERN IT Department CH-1211 Genève 23 Switzerland t File size and performance Tier0 Status - 14 DateAliceAtlasCMSLHCb CCRC May ’08322 MB1291 MB872 MB1327 MB March ‘08143 MB230 MB1490 MB865 MB CCRC Feb ’08340 MB320 MB1470 MB550 MB Jan ’08200 MB250 MB2000 MB200 MB

CERN IT Department CH-1211 Genève 23 Switzerland t Files migrated per day (CMS & ATLAS) Tier0 Status - 15 From 11/05 to 13/05 ~95K files were written to default pool. Top users, number of files: files 6080 files 4938 files The average file size was 48.8MBytes. Such issues are followed up.

CERN IT Department CH-1211 Genève 23 Switzerland t ATLAS Migrations Tier0 Status - 16 defaultt0atlas T0ATLAS is working well.Already mentioned issue on default pool...

CERN IT Department CH-1211 Genève 23 Switzerland t Repeat Tape Mounts Tier0 Status - 17 Repeated (read) tape mounts per VO avg repeat mount per vo per day High value of repeat mounts for reading...

CERN IT Department CH-1211 Genève 23 Switzerland t Get/Put Latency (Median) Tier0 Status - 18 Tier-0 exercise GC Problem Tier1 Data Import Power Cut!

CERN IT Department CH-1211 Genève 23 Switzerland t Agenda Resource Ramp-up CASTOR Performance and Metrics Power Issues and Progress Tier0 Status - 19

CERN IT Department CH-1211 Genève 23 Switzerland t Agenda Resource Ramp-up CASTOR Performance and Metrics Power Issues and Progress – B513 status – New Computer Centre Planning – Covering the Gap Tier0 Status - 20

CERN IT Department CH-1211 Genève 23 Switzerland t (Re)Design capacity was 2.5MW, but electrical subsystem was sized at 3.6MW for redundancy, so extra capacity should be available. Practical limits upstream (VA vs W issues, avoid running UPS systems at 100% capacity) limit maximum power for computing equipment to 2.9MW, i.e. +400kW. Attention during purchasing has significantly increased power efficiency—from 15SI2K/VA originally to 37SI2K/VA in early 2008 and 54SI2K/VA for the latest equipment offered. – Aggressive retirement schedule leads to significant power reduction—perhaps as much as 800kW. Current projection is that B513 can [only] accommodate all equipment to be delivered before end B513 Status Tier0 Status - 21

CERN IT Department CH-1211 Genève 23 Switzerland t New Computer Centre Planning In-house design and construction of new centre not possible (TS effort focussed on Linac4). No desire to tender for turn-key design and construction – Lowest cost bidder wins... Four phase process developed: 1.Request (many) conceptual designs 2.Commission 3-4 companies submitting conceptual designs to develop an outline design 3.In-house, turn a selected outline design into plans and documents enabling 4.Single tender for overall construction. “Call for proposals” for the conceptual designs sent out (deadline July 18 th ); process could lead to negotiation of construction contract end – Estimate subsequent detailed design phase of ~6 months and construction phase of ~18 months – New centre available for equipment installation in Jan 2012 Tier0 Status - 22

CERN IT Department CH-1211 Genève 23 Switzerland t Covering the Gap B513 OK until end New Centre from Jan 2012 ==> – Need to cover 2011 installations – Plus 1H2012 installations in case of construction delays. Tier1 centres asked for possible spare capacity in this window. – Oslo could be a possibility: Completing 2MW facility end-2009, but only need 1MW initially. Discussions on modalities (hardware requirements, operation model,...) to start soon. Reviewing co-lo options within ~1hr of CERN. – No spare capacity at present, – Options possible on 2011 timeframe, but still at ~2kW/m 2, so likely very expensive. Tier0 Status - 23

CERN IT Department CH-1211 Genève 23 Switzerland t Questions? Comments?