CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

Slides:

Advertisements

Similar presentations

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

Advertisements

Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…

Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.

Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Jos van Wezel Doris Ressmann GridKa, Karlsruhe TSM as tape storage backend for disk pool managers.

GridPP 12th Collaboration Meeting Networking: Current Status Robin Tasker 31 January 2005.

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

Les Les Robertson LCG Project Leader LCG - The Worldwide LHC Computing Grid LHC Data Analysis Challenges for 100 Computing Centres in 20 Countries HEPiX.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.

SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

The Worldwide LHC Computing Grid WLCG Service Ramp-Up LHCC Referees’ Meeting, January 2007.

WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.

LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.

LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.

BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.

The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.

WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

SC4 Planning Planning for the Initial LCG Service September 2005.

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Planning Issues GDB June Harry Renshall, Jamie Shiers.

LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.

The LHC Computing Environment Challenges in Building up the Full Production Environment [ Formerly known as the LCG Service Challenges ]

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.

ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon

The Worldwide LHC Computing Grid WLCG Milestones for 2007 Focus on Q1 / Q2 Collaboration Workshop, January 2007.

Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.

T0-T1 Networking Meeting 16th June Meeting

WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006

Computing Operations Roadmap

Ian Bird WLCG Workshop San Francisco, 8th October 2016

The LHC Computing Environment

LCG Service Challenge: Planning and Milestones

1er Colloquium LCG France

Data Challenge with the Grid in ATLAS

Database Readiness Workshop Intro & Goals

Update on Plan for KISTI-GSDC

~~~ LCG-LHCC Referees Meeting, 16th February 2010

The LCG Service Challenges: Ramping up the LCG Service

WLCG Service Report 5th – 18th July

LHC Data Analysis using a worldwide computing grid

WLCG Collaboration Workshop: Outlook for 2009 – 2010

The LHCb Computing Data Challenge DC06

Presentation transcript:

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager

The LHC Computing Grid – (The Worldwide LCG) Abstract The LCG Service Challenges are aimed at achieving the goal of a production quality world-wide Grid that meets the requirements of the LHC experiments in terms of functionality and scale. This talk highlights the main goals of the Service Challenge programme, significant milestones as well as the key services that have been validated in production by the LHC experiments. The LCG Service Challenge programme currently involves both the 4 LHC experiments as well as many sites, including the Tier0, all Tier1s as well as a number of key Tier2s, allowing all primary data flows to be demonstrated. The functionality so far addresses all primary offline Use Cases of the experiments except for analysis, the latter being addressed in the final challenge - scheduled to run from April until September prior to delivery of the full production Worldwide LHC Computing Service.

The LHC Computing Grid – (The Worldwide LCG) Agenda Overview of the Service Challenge Programme What we have – and have not – achieved so far The SC3 re-run (focusing on Tier0-Tier1 aspects) The SC3 re-run (focusing on Tier0-Tier1 aspects) Other aspects are also very important, but this is a key area where we did not meet our 2005 targets Other aspects are also very important, but this is a key area where we did not meet our 2005 targets Timetable and high level tasks for the future Summary and Conclusions

The LHC Computing Grid – (The Worldwide LCG) Introduction preparinghardening delivering The (W)LCG Service Challenges are about preparing, hardening and delivering the production Worldwide LHC Computing Environment (WLCG) 30 September 2006 The date for delivery of the production LHC Computing Environment is 30 September September 2005 Production Services are required as from 1 September 2005 and service phase of Service Challenge 3 1 June June 2006 note new schedule service phase of Service Challenge 4 This is not a drill. This is not a drill. GDB, October 2005

The LHC Computing Grid – (The Worldwide LCG) SC3 Goals Much more than just a throughput test! Much more than just a throughput test! More data management services: SRM required at all sites Reliable File Transfer Service based on gLite FTS LFC file catalog deployed as required Global catalog for LHCb, site local for ALICE + ATLAS [ Other services as per BSWG ] More sites: All Tier1s took part – this was better than foreseen! Many Tier2s – now above 20 sites, covering most regions. This too is working well! Workshops held in many countries / regions (T1 + T2s + experiments) – this should continue! UK, Italy, France, Germany, Asia-Pacific, North America, Nordic region, … (A-P w/s early May 2006; North American w/s around September GDB???) (A-P w/s early May 2006; North American w/s around September GDB???) All experiments: Clear goals established together with metrics for measuring success List of issues / requirements has been produced – plan for addressing remaining issues Throughput targets: 50% higher than SC2 but using SRM & FTS as above (150MB/s to disk at T1s) 60MB/s to tape at Tier1s (following disk – disk tests) Modest T2->T1 targets, representing MC upload (3 x 1GB file / hour) Activities kicked off to address these issues were successful

The LHC Computing Grid – (The Worldwide LCG) SC3(4) Service Summary Services identified through combination of Baseline Services Working Group, Storage Management Workshop and 1-1 discussions with experiments Timeline of BSWG & service setup lead time did not allow to wait for final report before starting to implement services For new services (LFC, FTS), two flavours established at CERN Pilot – to allow experiments to gain experience with functionality, adapt their s/w to interfaces etc. SC3 – full production services This separation proved useful and needs to continue with Pre-Production System! New services for sites: LFC (most sites), FTS (T1s), SRM (DPM, dCache at T2s) Support lists established for these services, plus global catch-call Concrete plan for moving VOs to GGUS is being discussed with VOs SC3 services being re-deployed for full production Some of this work was done during end-Oct / early Nov intervention Services by site; VO variations; Detailed Functionality and Timeline exists

The LHC Computing Grid – (The Worldwide LCG) SC3 Throughput Tests Unfortunately, July Throughput Tests did not meet targets Compounded by service instability Continued through to the end, i.e. disk – disk, disk – tape and T2 – T1 components Spent most of August debugging and fixing dCache workshop held in DESY identified concrete actions / configurations / dCache improvements Improvements also in CASTOR SRM & gLite FTS All software upgrades now released & deployed Disk – disk rates obtained in July around 1/2 target, without stability!

The LHC Computing Grid – (The Worldwide LCG) SC3 Throughput: Disk & Tape Disk target: 150MB/s/site 1GB/s (CERN) Tape target: 60MB/s/site (Not all sites) James Casey

The LHC Computing Grid – (The Worldwide LCG) Results of SC3 in terms of Transfers Target data rates 50% higher than during SC2 All T1s (most supporting T2s) participated in this challenge Transfers between SRMs (not the case in SC1/2) Important step to gain experience with the services before SC4 SiteMoU Target (Tape) Daily average MB/s (Disk) ASGC10010 BNL FNAL GridKa20042 CC-IN2P CNAF20050 NDGF50129 PIC10054 RAL15052 SARA/NIKHEF TRIUMF5034 Rates during July throughput tests. Better single-site rates since, but need to rerun tests… For this we need dCache 1.6.6(+) to be released/deployed, latest FTS (now), network upgrades etc. January 06 (<CHEP)

The LHC Computing Grid – (The Worldwide LCG) Post-debugging Now can achieve same rate as before with fewer sites Still need to add in other sites, and see what the new upper limit is James Casey

The LHC Computing Grid – (The Worldwide LCG) Pre-Requisites for Re-Run of Throughput Tests Deployment of gLite FTS 1.4 (srmcp support) Done at CERN in recent intervention dCache (or later) release and deployed at all dCache sites. Released – some sites already planning upgrade CASTOR2 clients and CASTORSRM version (or later) at all CASTOR sites (ASGC, CNAF, PIC). Upgrade to CERN internal network infrastructure. Partly done – remainder during Christmas shutdown N.B. intend to keep Grid running over Xmas! (Close to last chance…) 10Gbit/s network connections at operational at the following sites: IN2P3, GridKA, CNAF, NIKHEF/SARA, BNL, FNAL

The LHC Computing Grid – (The Worldwide LCG) dCache – the Upgrade (CHEP 2006) For the last two years, the dCache/SRM Storage Element has been successfully integrated into the LCG framework and is in heavy production at several dozens of sites, spanning a range from single host installations up to those with some hundreds of TB of disk space, delivering more than 50 TB per day to clients. Based on the permanent feedback from our users and the detailed reports given by representatives of large dCache sites during our workshop at DESY end of August 2005, the dCache team has been identified important areas of improvement. This includes a more sophisticated handling of the various supported tape back-ends, the introduction of multiple I/O queues per pool with different properties to account for the diverse behaviours of the different I/O protocols and the possibility to have one dCache instance spread over more than one physical site. … changes in the name-space management as short and long term perspective to keep up with future requirements. … initiative to make dCache a widely scalable storage element by introducing dCache, the Book, plans for improved packaging and more convenient source code license terms. Finally I would like to cover the dCache part of the German e-science project, d-Grid, which will allow for improved scheduling of tape to disk restore operations as well as advanced job scheduling by providing extended information exchange between storage elements and Job Scheduler.

The LHC Computing Grid – (The Worldwide LCG) Disk – Disk Rates (SC3 Repeat) These are the nominal data rates capped at 150MB/s Centre (Nominal rate)ALICEATLASCMSLHCb Target Data Rate MBytes/sec (Site target) Canada, TRIUMFX50 (++) France, CC-IN2P3 (200MB/s)XXXX150 (100?) Germany, GridKA (200MB/s)XXXX150 Italy, CNAF (200MB/s)XXXX150 Netherlands, NIKHEF/SARAXXX150 Nordic Data Grid FacilityXXX50 Spain, PIC BarcelonaXXX100 (30) Taipei, ASGCXX100 (75) UK, RALXXXX150 USA, BNL (200MB/s)X150 USA, FNAL (200MB/s)X150 January 2006

The LHC Computing Grid – (The Worldwide LCG) Disk – Tape Rates (SC3 Re-run ) Target ~5 drives of current technology Rate per site is 50MB/s (60MB/s was July target) CentreALICEATLASCMSLHCbTarget Data Rate MB/s Canada, TRIUMFX50 France, CC-IN2P3XXXX50 Germany, GridKAXXXX50 Italy, CNAFXXXX50 Netherlands, NIKHEF/SARAXXX50 Nordic Data Grid FacilityXXX50 Spain, PIC BarcelonaXXX50 Taipei, ASGCXX50 UK, RALXXXX50 USA, BNLX50 USA, FNALX50 February 2006

The LHC Computing Grid – (The Worldwide LCG) SC3 Summary Underlined the complexity of reliable, high rate sustained transfers to be met during LHC data taking Many issues understood and resolved – need to confirm by re-run of Throughput exercise We are now well into the Service Phase (Sep – Dec) Collaboration with sites & experiments has been excellent We are continuing to address problems as they are raised Together with preparing for SC4 and WLCG pilot / production The experiment view have been summarised by Nick Brook See also GDB presentations of this week

The LHC Computing Grid – (The Worldwide LCG) SC3 Tier0 – Tier1 Disk – Disk Rerun Involved all Tier1 sites + DESY BNL, CNAF, DESY, FNAL, FZK, IN2P3, NDGF, PIC, RAL, SARA, TAIWAN, TRIUMF Preparation phase significantly smoother than previously Although a number of the problems seen had occurred before… As usual, better documentation… Sites clearly have a much (much) better handle on the systems now… What to tune, preferred settings etc. We still do not have the stability required / desired… The daily average needs to meet / exceed targets We need to handle this without heroic efforts at all times of day / night! We need to sustain this over many (100) days We need to test recovery from problems (individual sites – also Tier0) But a big improvement, also in rates achieved But a big improvement, also in rates achieved Limited by h/w configuration at CERN to ~1GB/s – the target (average)

The LHC Computing Grid – (The Worldwide LCG) SC3 Re-run Results We managed to obtain close to 1GB/s for extended periods We managed to obtain close to 1GB/s for extended periods Several sites exceeded the targets agreed Several sites exceeded the targets agreed Several sites reached or even exceeded their nominal rates Several sites reached or even exceeded their nominal rates Still see quite a few operational problems And the software is not completely debugged yet… But its an encouraging step on reaching the final goals… But its an encouraging step on reaching the final goals…

The LHC Computing Grid – (The Worldwide LCG)

Achieved (Nominal) pp data rates CentreALICEATLASCMSLHCbRate into T1 (pp) Disk-Disk (SRM) rates in MB/s ASGC, Taipei- -80 (100) (have hit 110) CNAF, Italy 200 PIC, Spain- >30 (100) (network constraints) IN2P3, Lyon 200 GridKA, Germany 200 RAL, UK- 200 (150) BNL, USA (200) FNAL, USA-- ->200 (200) TRIUMF, Canada (50) SARA, NL (150) Nordic Data Grid Facility (50) Meeting or exceeding nominal rate (disk – disk) Meeting target rate for SC3 disk-disk re-run Missing: (globally) – rock solid stability, tape back-end SC4 T0-T1 throughput goals: nominal rates to disk (April) and tape (July) To come: Srm copy support in FTS; CASTOR2 at remote sites; SLC4 at CERN; Network upgrades etc.

The LHC Computing Grid – (The Worldwide LCG) Currently Scheduled Throughput Tests January 2006 – rerun of SC3 disk – disk transfers (max 150MB/s) January 2006 – rerun of SC3 disk – disk transfers (max 150MB/s) All Tier1s and DESY participated. Achieved ~1GB/s out of CERN; good rates to sites February 2006 – rerun of SC3 disk – tape transfers (50MB/s – was 60MB/s in July) Sites should allocate 5 current generation drives and understand issues involved March 2006 – T0-T1 loop-back tests at 2 x nominal rate CERN, using new tape technology and corresponding infrastructure April 2006 – T0-T1 disk-disk (nominal rates) disk-tape (50-75MB/s) All Tier1s – disk rates at BNL, FNAL, CNAF, FZK, IN2P3 go up to 200MB/s July 2006 – T0-T1 disk-tape (nominal rates) All Tier1s – rates 50 – 200MB/s depending on VOs supported & resources provided T1-T1; T1-T2; T2-T1 and other rates TBD according to CTDRs All Tier1s; 20 – 40 Tier2s; all VOs; all offline Use Cases Still significant work ahead for experiments, T0, T1s and T2s!

The LHC Computing Grid – (The Worldwide LCG) SC3 Re-Run: Tape Throughput Tests DESY achieved a cracking 100MB/s; later throttled back to 80MB/s In general, transfers not as smooth as disk-disk ones, but this is probably to be expected At least until things are properly optimised… The target of 50MB/s sustained per site with ~5 drives of the current technology was met Although not all sites took part (some already met this target…) Next disk-tape throughput test scheduled for April 2006 Same rates; all sites Full nominal rates scheduled for July 2006

The LHC Computing Grid – (The Worldwide LCG) Further Throughput Tests Need to define throughput tests for TierX TierY transfers Mumbai (February) & CERN (June) workshops Need to demonstrate sustained average throughput at or above MoU targets from Tier0 to tape at all Tier1s Need to demonstrate recovery of backlog(s) ~4 hour downtimes at individual Tier1s : ~1 day downtime? : more? ~4 hour downtime at the Tier0 (!) : ~1 day downtime? Use operations log to establish realistic Use Cases Full end-to-end throughput demonstration oNeed to simulate a full data taking period (and repeat until it works…) All the above confirmed using experiment-driven transfers

The LHC Computing Grid – (The Worldwide LCG) SC3 Services – Lessons (re-)Learnt L O N G It takes a L O N G time to put services into (full) production A lot of experience gained in running these services Grid-wide Merge of SC and CERN daily operations meeting has been good Still need to improve Grid operations and Grid support Still need to improve Grid operations and Grid support A CERN Grid Operations Room needs to be established Need to be more rigorous about: Announcing scheduled downtimes; Reporting unscheduled ones; Announcing experiment plans; Reporting experiment results; Attendance at V-meetings; … Adaily OPS meeting is foreseen for LHC preparation / commissioning Adaily OPS meeting is foreseen for LHC preparation / commissioning Being addressed now

The LHC Computing Grid – (The Worldwide LCG) WLCG - Major Challenges Ahead 1.Get data rates at all Tier1s up to MoU Values This is currently our biggest challenge but good progress recently! Plan is to work with a few key sites and gradually expand (Focus on highest-data rate sites initially…) 2.(Re-)deploy Required Services at Sites so that they meet MoU Targets Tier0 will have all services re-deployed prior to SC4 Service Phase (WLCG Pilot) Plans are being shared with Tier1s and Tier2s, as will be experience LCG Service Coordination team will be proactive in driving this forward A lot of work, but no major show-stopper foreseen 3.Understand other key Use Cases for verification / validation This includes also other inter-Tier transfers (see workshops) Many will be tested by experiment production Which should be explicitly tested as dedicated Service Tests?

The LHC Computing Grid – (The Worldwide LCG) Timeline JanuarySC3 disk repeat – nominal rates capped at 150MB/s JulyTape Throughput tests at full nominal rates! FebruaryCHEP w/s – T1-T1 Use Cases, SC3 disk – tape repeat (50MB/s, 5 drives) AugustT2 Milestones – debugging of tape results if needed MarchDetailed plan for SC4 service agreed (M/W + DM service enhancements) SeptemberLHCC review – rerun of tape tests if required? AprilSC4 disk – disk (nominal) and disk – tape (reduced) throughput tests OctoberWLCG Service Officially opened. Capacity continues to build up. MayDeployment of new m/w release and services across sites November1 st WLCG conference All sites have network / tape h/w in production(?) JuneStart of SC4 production Tests by experiments of T1 Use Cases Tier2 workshop – identification of key Use Cases and Milestones for T2s DecemberFinal service / middleware review leading to early 2007 upgrades for LHC data taking?? O/S Upgrade? Sometime before April 2007!

The LHC Computing Grid – (The Worldwide LCG) Conclusions A great deal of progress in less than one year… Which is all we have left until FULL PRODUCTION Focus now is on SERVICE and STABILITY Service levels & functionality (including data transfers) defined in WLCG MoU A huge amount of work by many people… Thanks to all! (From CSO & LHCC Referees too!)

The LHC Computing Grid – (The Worldwide LCG) Acknowledgements The above represents the (hard) work of a large number of people across many sites and within all LHC collaborations Thanks to everybody for their participation, dedication and enthusiasm!

The LHC Computing Grid – (The Worldwide LCG)