WLCG Status Ian Bird Overview Board 3 rd December 2010.

Slides:

Advertisements

Similar presentations

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Advertisements

Ian Bird LCG Project Leader LHCC + C-RSG review. 2 Review of WLCG  To be held Feb 16 at CERN  LHCC Reviewers:  Amber Boehnlein  Chris.

T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

LHCb Quarterly Report October Core Software (Gaudi) m Stable version was ready for 2008 data taking o Gaudi based on latest LCG 55a o Applications.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Ian Bird LHCC Referee meeting 23 rd September 2014.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.

Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Summary of RSG/RRB Ian Bird GDB 9 th May 2012

WLCG Service Report ~~~ WLCG Management Board, 9 th August

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Ian Bird GDB CERN, 9 th September Sept 2015

LCG Introduction John Gordon, STFC GDB December14 th 2011.

Procedure to follow for proposed new Tier 1 sites Ian Bird CERN, 27 th March 2012.

Procedure for proposed new Tier 1 sites Ian Bird WLCG Overview Board CERN, 9 th March 2012.

LHCbComputing LHCC status report. Operations June 2014 to September m Running jobs by activity o Montecarlo simulation continues as main activity.

LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.

Project Status Report Ian Bird Computing Resource Review Board 20 th April, 2010 CERN-RRB

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting 01/12/2015.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

Victoria, Sept WLCG Collaboration Workshop1 ATLAS Dress Rehersals Kors Bos NIKHEF, Amsterdam.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.

Ian Bird, CERN WLCG LHCC Referee Meeting 1 st December 2015 LHCC; 1st Dec 2015 Ian Bird; CERN1.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

1 June 11/Ian Fisk CMS Model and the Network Ian Fisk.

Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

ATLAS Computing: Experience from first data processing and analysis Workshop TYL’10.

LHCbComputing Update of LHC experiments Computing & Software Models Selection of slides from last week’s GDB

WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.

Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.

ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.

Evolution of storage and data management

Ian Bird WLCG Workshop San Francisco, 8th October 2016

LCG Service Challenge: Planning and Milestones

gLite->EMI2/UMD2 transition

evoluzione modello per Run3 LHC

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

Update on Plan for KISTI-GSDC

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Project Status Report Computing Resource Review Board Ian Bird

John Gordon, STFC GDB October 12th 2011

WLCG Collaboration Workshop;

ATLAS DC2 & Continuous production

Presentation transcript:

WLCG Status Ian Bird Overview Board 3 rd December 2010

Project report: – Brief status update: preparation for and experience with HI – Resources usage – Tier 0 – for information – EGI/EMI – follow up from last OB – Future actions Middleware review … Data management … Virtualisation … SIR analysis – Follow up on action from last meeting Planning 2011,12,13 – Considerations in view of prospects for data Agenda for today

Following request at last OB: – Board/Shared%20Documents/Forms/AllItems.aspx Board/Shared%20Documents/Forms/AllItems.aspx – Contains material/pointers to places where WLCG is mentioned/thanked/… Useful information

In September just before the LHCC meeting: Learned of intention for ALICE and CMS to run at much higher rates: ALICE 12.5 MB/evt * 100 Hz (1,250 MB/s)  100 TB/day ATLAS 2.2 MB/evt * Hz ( MB/s)  TB/day CMS 12 MB/evt * 150 Hz (1,800 MB/s) NB: large event size due to no zero suppression for HI 2010, and no trigger  150 TB/day LHCb No HI data taking Preparations for Heavy Ion run 12.5 MB/evt * 200 Hz (2,500 MB/s)  200 TB/day - Doubling of ALICE rate -(same integrated data volume as planned) - New large CMS request – only arriving now - Doubling of ALICE rate -(same integrated data volume as planned) - New large CMS request – only arriving now

ALICE No complete export/processing during the heavy-ion run  Export to be completed in the months after HI  During the HI run: ~20% of the RAW rate (hence ~200MB/s) ~20% RAW reconstruction (but NO ESD recording) ATLAS Complete RAW export to all Tier 1 as in pp running Some processing: some ESD+DESD(+AOD) to tape and export NB: (ESD+DESD)/RAW is ~ 6 (Very large ESD and DESD) Assumption: at CERN express stream only – the rest is done at the Tier1s CMS Complete RAW export to 1 site (FNAL) Full processing at CERN Reminder: Following data loss incident in May: Realised risk to ALICE data during the weeks following data taking when there would only be a single copy Mitigation was that we would expand ALICE disk pool to keep 2 copies at CERN until data had been exported to Tier 1s Had not planned this for CMS as expectation was that all HI data would be copied to FNAL Heavy Ion processing

Castor pools were expanded for ALICE + CMS – To ensure 2 full copies at CERN until data can be migrated to Tier 1s – CMS: able since already had 2011 resources available Planned tests for technical stop – Required upgrade of Castor and individual and combined experiment testing at full expected rates Test had to be brought forward urgently as the technical stop was changed – Many people were at CHEP conference – However, tests were successful, but only very limited combined testing was possible Heavy Ion preparations

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 7 Castor Tape subsystem > 5GB/s to tape this Sunday Data written to tape – last month (1 k GiB = 1 TB = 1 tape) Heavy Ion run going on smoothly at impressive data rates: Oct 2009 to April 2010: 0.7 PB/month (23 1TB tapes /day) May 2010 to October: 1.9 PB/month (62 1TB tapes / day) November (Heavy Ion run) : ~ 3-4PB ? (120 1TB tapes / day) ALICE+CMS HI test HI Data taking

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 8 Update on tape writing … Last Friday we exceeded 220 TB (tapes) in a single day which is the new HEP world record. Castor name-server DB problem during this day

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 9 Castor Disk Usage Disk Servers (Gbytes/s) – last month Per experiment – early Nov ALICE ATLAS CMS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 10 Summary – Tier 0 for HI System behaving very well, can accept and manage the very high data rates Wrote ~3.4 PB to tape in November Peak rates: 220 TB/day to tape; 5.9 GB/s data in –pp running ~ 2PB/month Tape drives used very efficiently (this had been a major concern); because: –Castor upgrade improved scheduling –Large files are being written (up to 40 GB) –Using only ~50 drives even at peak rates (worried that we would need many more); write speeds close to native drive speed (average of 80 MB/s)

EXPERIMENT EXPERIENCES

Pb+Pb at √s=2.76TeV

ALICE

ALICE

ALICE

LHCC17/11/10 16 CMS – HI Test ‣ As advised we ran a IO test with ALICE during a technical stop during the proton-proton run. ‣ By the second half of the test we got to 1GB/s out of the T0export buffer to tape ‣ This test did not include re- reconstruction ‣ Technical stop was moved with short notice to during CHEP, but generally successful

Overview Board03/12/10 17 Tier-0 in Heavy Ions ‣ Prompt Reconstruction of Heavy Ion is currently able to keep up with data taking ‣ Good machine live time this week ‣ Should be able to finish the run with 100% prompt reconstruction Prompt Reco from Previous fill Completing

Overview Board03/12/10 18 Reconstruction in Heavy Ion ‣ Reconstruction from earlier in the run ‣ Clean triggers

Overview Board03/12/10 19 IO ‣ Castor IO holding up well ‣ IO tracks with CPU ‣ Averaging about 1GB/s to tape. Peaks while writing are 2GB/s as expected. Read rates from disk peak at ~5GB/s ‣ IO was optimized in CMSSW_3_8 and makes this Heavy Ion processing possible

First Heavy Ion Collisions 20 First collisions on Sunday, Nov. 7, at 00:30 No stable beams – Pixel detector was switched off, SCT in standby voltage Setting up of the machine started on November 1

First Heavy Ion Stable Beam Collisions 21 1 st stable beams (1 bunch colliding) Monday, Nov. 8, ~11:20 4 colliding bunches: Tuesday, 01:01 Now routinely with 121 colliding bunches

22 Submitted to Phys.Rev.Letters last Thursday Accepted on Friday, 10 hours later

Fall re-processing finished Fall re-processing May re-processing Group production 40,000 jobs Xpress line 970,000 jobs bulk 900,000,000 events Aug 30 Release 16 tag deadline Sept 1-24 Release validation Sept 24 Conditions data deadline Extra week for Inner Detector Alignment Oct 8-14 Express Line re-processing Oct 29 – Nov.21 Bulk re-processing Nov Crashed job (very few!) recovery

All production Simulation Reconstruction Re-Processing group production User Analysis Month view 50,000 jobs Year view 20,000 jobs Year view User Analysis Fully occupying job slots in Tier 1 and Tier 2

ATLAS – data movement

Aim for an average rate of MB/s into Castor. Tier 0 event processing is keeping up, but just barely. It could get backed up if the LHC continues to deliver at a higher duty cycle – Tested a backup plan to do prompt processing at T1 in order to alleviate T0 backlog Raw HI data is distributed like PP data to Tier1s Derived data, produced at T0 is delivered on request to special HI centres (BNL, Weizman and Cracow) ATLAS HI

LHCb

2010 reprocessing  Started 27/11, 45% done in 5 days 28 CPU years No. of jobs

2010 reprocessing efficiency % job success rate CPU/Wall clock efficiency

Pile-up: – LHCb running at higher pileup than design Larger event size and larger processing times are putting a strain on disk and CPU resources Use of Tier 2 envisaged for reconstruction in 2011 – Currently does not limit physics reach, but will not be able to go much beyond current 2 kHz HLT rate with existing/planned resources LHCb concerns

CREAM CE: – More instances available; all experiments have tested; beginning to be more used Automated gathering installed capacity – Tool available and in use – Tier 0/1 OK for overall information; not complete for experiment dependent capacities – Tier 2 reporting being verified (led by Tier 1s) Data management prototypes – Progress presented in GDB – Review in January Multi-user pilot jobs – Software (glexec, SCAS, ARGUS) available and largely deployed – Experiment framework reviews done and followed up – Testing and use has been on hold during data taking – Has not been a (reported) problem … Milestones

Whole node scheduling is: – Allocating an entire machine to the experiment (not just a job slot) to allow the experiment to optimise use of all the cores (potentially with a mix of jobs) – Responsibility for full utilization is then the experiments’ Informally requested by experiments in July workshop Eventually (Oct) agreed by computing coordinators that this is a real request from the experiments Pere Mato charged with setting up a wg to: – Track commissioning of multi-core jobs in the experiments; mediate between experiments and sites on specific problems found, understand realistic requirements for eventual deployment – Understand changes needed in job submission chain from user interface to batch system configurations – Prepare roadmap for deployment of end-end submission – Propose needed updates for accounting and monitoring Status: – Requirement & Mandate agreed by 3 experiments – Proposing wg members – Process to be agreed in MB next week Timescale: – WG to conclude by May 2011 Whole node scheduling

Now Tier 1s and Tier 2s start to be fully occupied; as planned with reprocessing, analysis, and simulation loads Resource usage Tier 1 disk and CPU usage - Oct Tier 1 use - OctCPU use/pledgeDisk use/pledge ALICE ATLAS CMS LHCb Overall NB: Assumed effic factors 0.85 for CPU 0.70 for disk NB: Assumed effic factors 0.85 for CPU 0.70 for disk

At the October RRB the message heard by the funding agencies was that there are plenty of resources Several delegates picked up on this and made it clear that it would be difficult to provide additional resources without first filling what we have But we are already at that point We are surely still in a ramp-up as the luminosity increases See later on planning for 2012 and later Resources

We have had a very successful first year of data taking, processing, analysis However, we have not yet seen the situation where resource contention or over loading has been a problem With the potentially large amounts of data foreseen for 2011 etc this is likely to happen Must be prepared to react and address these issues Need to be pro-active in being ready for this: – Is site monitoring providing all it can? – Do sites get alarms for such conditions (over load …)? – What happens … graceful failure … or hang/timeout/…? Urgently need to start activities to address this – Already planning a study for the Tier 0 Some concerns

End Nov was deadline for proposals for remote Tier 0 (extended from Oct 31) Have received 23!! Analysis ongoing Status of Tier 0 planning

Intend to urgently review with experiments – What middleware functions are essential to them now? – What should effort be put into? – E.g. how best/better to support pilot jobs; can we simplify information system; etc. – cvmfs replacing software installation Data management – Actions from “demonstrator” projects will be reviewed in January – Clear that the various efforts on remote file access/location are coalescing – Caching mechanisms needed – What will help experiments most in the next 2 years Virtualisation – CERN tested completed for the moment – have a “lxcloud” pilot – Hepix wg agreed mechanisms to allow trust of images Future developments

Discussions ongoing on WLCG agreeing an MoU with EGI as a Virtual Research Community (perhaps as a seed for a HEP community) – Agreed that wording can be found to work around the problem of blanket acceptance of policy – Not really clear what the overall benefit to WLCG might be EGI Operations still very quiet … – Areas where EGI-Inspire is not providing funding, which are thus not covered … – Staged rollout of software does not work i.e. no sites are test-deploying any of the updates So far for WLCG there has been no real problem EGI

Positive points: – Provides effort for important support topics such as data management (dCache, FTS,..), etc. – Agreed to have a direct channel for WLCG needs into EMI Less positive: – Primary goal of EMI is “harmonisation” between gLite/ARC/UNICORE This is ~irrelevant for WLCG – Release plan of EMI is of concern Back to the idea of yearly major releases – WLCG needs a continuous support for releases as we have been doing The EMI-0, EMI-1 are starting to cause deviation of effort (building, certification) from what we need – We now need to re-focus on the core middleware that we will need in the next few years EMI will need to adapt to support this EMI

Very successful support of the first year of LHC data taking – HI testing successful and HI run going well – unprecedented data rates are being managed efficiently Resource use has grown in last months – Tier 1 and 2 resources well used – However must prepare for full loading and potential resource contention Concern over 2012 planning – need to foresee additional resources if LHC runs in 2012 – See later talk Various plans for future improvements ongoing Summary