Download presentation
Presentation is loading. Please wait.
Published byKaren Perkins Modified over 9 years ago
1
WLCG Status Ian Bird Overview Board 3 rd December 2010
2
Project report: – Brief status update: preparation for and experience with HI – Resources usage – Tier 0 – for information – EGI/EMI – follow up from last OB – Future actions Middleware review … Data management … Virtualisation … SIR analysis – Follow up on action from last meeting Planning 2011,12,13 – Considerations in view of prospects for data Ian.Bird@cern.ch2 Agenda for today
3
Following request at last OB: – https://espace.cern.ch/WLCG-Overview- Board/Shared%20Documents/Forms/AllItems.aspx https://espace.cern.ch/WLCG-Overview- Board/Shared%20Documents/Forms/AllItems.aspx – Contains material/pointers to places where WLCG is mentioned/thanked/… Ian.Bird@cern.ch3 Useful information
4
In September just before the LHCC meeting: Learned of intention for ALICE and CMS to run at much higher rates: ALICE 12.5 MB/evt * 100 Hz (1,250 MB/s) 100 TB/day ATLAS 2.2 MB/evt * 50-100 Hz (110-220 MB/s) 10-20 TB/day CMS 12 MB/evt * 150 Hz (1,800 MB/s) NB: large event size due to no zero suppression for HI 2010, and no trigger 150 TB/day LHCb No HI data taking Preparations for Heavy Ion run 12.5 MB/evt * 200 Hz (2,500 MB/s) 200 TB/day - Doubling of ALICE rate -(same integrated data volume as planned) - New large CMS request – only arriving now - Doubling of ALICE rate -(same integrated data volume as planned) - New large CMS request – only arriving now
5
ALICE No complete export/processing during the heavy-ion run Export to be completed in the months after HI During the HI run: ~20% of the RAW rate (hence ~200MB/s) ~20% RAW reconstruction (but NO ESD recording) ATLAS Complete RAW export to all Tier 1 as in pp running Some processing: some ESD+DESD(+AOD) to tape and export NB: (ESD+DESD)/RAW is ~ 6 (Very large ESD and DESD) Assumption: at CERN express stream only – the rest is done at the Tier1s CMS Complete RAW export to 1 site (FNAL) Full processing at CERN Reminder: Following data loss incident in May: Realised risk to ALICE data during the weeks following data taking when there would only be a single copy Mitigation was that we would expand ALICE disk pool to keep 2 copies at CERN until data had been exported to Tier 1s Had not planned this for CMS as expectation was that all HI data would be copied to FNAL Heavy Ion processing
6
Castor pools were expanded for ALICE + CMS – To ensure 2 full copies at CERN until data can be migrated to Tier 1s – CMS: able since already had 2011 resources available Planned tests for technical stop – Required upgrade of Castor and individual and combined experiment testing at full expected rates Test had to be brought forward urgently as the technical stop was changed – Many people were at CHEP conference – However, tests were successful, but only very limited combined testing was possible Ian.Bird@cern.ch6 Heavy Ion preparations
7
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 7 Castor Tape subsystem > 5GB/s to tape this Sunday Data written to tape – last month (1 k GiB = 1 TB = 1 tape) Heavy Ion run going on smoothly at impressive data rates: Oct 2009 to April 2010: 0.7 PB/month (23 1TB tapes /day) May 2010 to October: 1.9 PB/month (62 1TB tapes / day) November (Heavy Ion run) : ~ 3-4PB ? (120 1TB tapes / day) ALICE+CMS HI test HI Data taking
8
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 8 Update on tape writing … Last Friday we exceeded 220 TB (tapes) in a single day which is the new HEP world record. Castor name-server DB problem during this day
9
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 9 Castor Disk Usage Disk Servers (Gbytes/s) – last month Per experiment – early Nov ALICE ATLAS CMS
10
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services DSS 10 Summary – Tier 0 for HI System behaving very well, can accept and manage the very high data rates Wrote ~3.4 PB to tape in November Peak rates: 220 TB/day to tape; 5.9 GB/s data in –pp running ~ 2PB/month Tape drives used very efficiently (this had been a major concern); because: –Castor upgrade improved scheduling –Large files are being written (up to 40 GB) –Using only ~50 drives even at peak rates (worried that we would need many more); write speeds close to native drive speed (average of 80 MB/s)
11
EXPERIMENT EXPERIENCES
12
Pb+Pb at √s=2.76TeV
13
ALICE
14
Ian.Bird@cern.ch14 ALICE
15
Ian.Bird@cern.ch15 ALICE
16
LHCC17/11/10 16 CMS – HI Test ‣ As advised we ran a IO test with ALICE during a technical stop during the proton-proton run. ‣ By the second half of the test we got to 1GB/s out of the T0export buffer to tape ‣ This test did not include re- reconstruction ‣ Technical stop was moved with short notice to during CHEP, but generally successful
17
Overview Board03/12/10 17 Tier-0 in Heavy Ions ‣ Prompt Reconstruction of Heavy Ion is currently able to keep up with data taking ‣ Good machine live time this week ‣ Should be able to finish the run with 100% prompt reconstruction Prompt Reco from Previous fill Completing
18
Overview Board03/12/10 18 Reconstruction in Heavy Ion ‣ Reconstruction from earlier in the run ‣ Clean triggers
19
Overview Board03/12/10 19 IO ‣ Castor IO holding up well ‣ IO tracks with CPU ‣ Averaging about 1GB/s to tape. Peaks while writing are 2GB/s as expected. Read rates from disk peak at ~5GB/s ‣ IO was optimized in CMSSW_3_8 and makes this Heavy Ion processing possible
20
First Heavy Ion Collisions 20 First collisions on Sunday, Nov. 7, at 00:30 No stable beams – Pixel detector was switched off, SCT in standby voltage Setting up of the machine started on November 1
21
First Heavy Ion Stable Beam Collisions 21 1 st stable beams (1 bunch colliding) Monday, Nov. 8, ~11:20 4 colliding bunches: Tuesday, 01:01 Now routinely with 121 colliding bunches
22
22 Submitted to Phys.Rev.Letters last Thursday Accepted on Friday, 10 hours later
23
Fall re-processing finished Fall re-processing May re-processing Group production 40,000 jobs Xpress line 970,000 jobs bulk 900,000,000 events Aug 30 Release 16 tag deadline Sept 1-24 Release validation Sept 24 Conditions data deadline Extra week for Inner Detector Alignment Oct 8-14 Express Line re-processing Oct 29 – Nov.21 Bulk re-processing Nov 21-30 Crashed job (very few!) recovery
24
All production Simulation Reconstruction Re-Processing group production User Analysis Month view 50,000 jobs Year view 20,000 jobs Year view User Analysis Fully occupying job slots in Tier 1 and Tier 2
25
Ian.Bird@cern.ch25 ATLAS – data movement
26
Aim for an average rate of 500-600 MB/s into Castor. Tier 0 event processing is keeping up, but just barely. It could get backed up if the LHC continues to deliver at a higher duty cycle – Tested a backup plan to do prompt processing at T1 in order to alleviate T0 backlog Raw HI data is distributed like PP data to Tier1s Derived data, produced at T0 is delivered on request to special HI centres (BNL, Weizman and Cracow) Ian.Bird@cern.ch26 ATLAS HI
27
Ian.Bird@cern.ch27 LHCb
28
2010 reprocessing Started 27/11, 45% done in 5 days 28 CPU years No. of jobs
29
2010 reprocessing efficiency 29 99.4% job success rate CPU/Wall clock efficiency
30
Pile-up: – LHCb running at higher pileup than design Larger event size and larger processing times are putting a strain on disk and CPU resources Use of Tier 2 envisaged for reconstruction in 2011 – Currently does not limit physics reach, but will not be able to go much beyond current 2 kHz HLT rate with existing/planned resources Ian.Bird@cern.ch30 LHCb concerns
31
CREAM CE: – More instances available; all experiments have tested; beginning to be more used Automated gathering installed capacity – Tool available and in use – Tier 0/1 OK for overall information; not complete for experiment dependent capacities – Tier 2 reporting being verified (led by Tier 1s) Data management prototypes – Progress presented in GDB – Review in January Multi-user pilot jobs – Software (glexec, SCAS, ARGUS) available and largely deployed – Experiment framework reviews done and followed up – Testing and use has been on hold during data taking – Has not been a (reported) problem … Ian.Bird@cern.ch31 Milestones
32
Whole node scheduling is: – Allocating an entire machine to the experiment (not just a job slot) to allow the experiment to optimise use of all the cores (potentially with a mix of jobs) – Responsibility for full utilization is then the experiments’ Informally requested by experiments in July workshop Eventually (Oct) agreed by computing coordinators that this is a real request from the experiments Pere Mato charged with setting up a wg to: – Track commissioning of multi-core jobs in the experiments; mediate between experiments and sites on specific problems found, understand realistic requirements for eventual deployment – Understand changes needed in job submission chain from user interface to batch system configurations – Prepare roadmap for deployment of end-end submission – Propose needed updates for accounting and monitoring Status: – Requirement & Mandate agreed by 3 experiments – Proposing wg members – Process to be agreed in MB next week Timescale: – WG to conclude by May 2011 Ian.Bird@cern.ch32 Whole node scheduling
33
Now Tier 1s and Tier 2s start to be fully occupied; as planned with reprocessing, analysis, and simulation loads Ian.Bird@cern.ch33 Resource usage Tier 1 disk and CPU usage - Oct Tier 1 use - OctCPU use/pledgeDisk use/pledge ALICE1.040.25 ATLAS0.940.89 CMS0.540.74 LHCb0.270.79 Overall0.780.75 NB: Assumed effic factors 0.85 for CPU 0.70 for disk NB: Assumed effic factors 0.85 for CPU 0.70 for disk
34
At the October RRB the message heard by the funding agencies was that there are plenty of resources Several delegates picked up on this and made it clear that it would be difficult to provide additional resources without first filling what we have But we are already at that point We are surely still in a ramp-up as the luminosity increases See later on planning for 2012 and later Ian.Bird@cern.ch34 Resources
35
We have had a very successful first year of data taking, processing, analysis However, we have not yet seen the situation where resource contention or over loading has been a problem With the potentially large amounts of data foreseen for 2011 etc this is likely to happen Must be prepared to react and address these issues Need to be pro-active in being ready for this: – Is site monitoring providing all it can? – Do sites get alarms for such conditions (over load …)? – What happens … graceful failure … or hang/timeout/…? Urgently need to start activities to address this – Already planning a study for the Tier 0 Ian.Bird@cern.ch35 Some concerns
36
End Nov was deadline for proposals for remote Tier 0 (extended from Oct 31) Have received 23!! Analysis ongoing Ian.Bird@cern.ch36 Status of Tier 0 planning
37
Intend to urgently review with experiments – What middleware functions are essential to them now? – What should effort be put into? – E.g. how best/better to support pilot jobs; can we simplify information system; etc. – cvmfs replacing software installation Data management – Actions from “demonstrator” projects will be reviewed in January – Clear that the various efforts on remote file access/location are coalescing – Caching mechanisms needed – What will help experiments most in the next 2 years Virtualisation – CERN tested completed for the moment – have a “lxcloud” pilot – Hepix wg agreed mechanisms to allow trust of images Ian.Bird@cern.ch37 Future developments
38
Discussions ongoing on WLCG agreeing an MoU with EGI as a Virtual Research Community (perhaps as a seed for a HEP community) – Agreed that wording can be found to work around the problem of blanket acceptance of policy – Not really clear what the overall benefit to WLCG might be EGI Operations still very quiet … – Areas where EGI-Inspire is not providing funding, which are thus not covered … – Staged rollout of software does not work i.e. no sites are test-deploying any of the updates So far for WLCG there has been no real problem Ian.Bird@cern.ch38 EGI
39
Positive points: – Provides effort for important support topics such as data management (dCache, FTS,..), etc. – Agreed to have a direct channel for WLCG needs into EMI Less positive: – Primary goal of EMI is “harmonisation” between gLite/ARC/UNICORE This is ~irrelevant for WLCG – Release plan of EMI is of concern Back to the idea of yearly major releases – WLCG needs a continuous support for releases as we have been doing The EMI-0, EMI-1 are starting to cause deviation of effort (building, certification) from what we need – We now need to re-focus on the core middleware that we will need in the next few years EMI will need to adapt to support this Ian.Bird@cern.ch39 EMI
40
Very successful support of the first year of LHC data taking – HI testing successful and HI run going well – unprecedented data rates are being managed efficiently Resource use has grown in last months – Tier 1 and 2 resources well used – However must prepare for full loading and potential resource contention Concern over 2012 planning – need to foresee additional resources if LHC runs in 2012 – See later talk Various plans for future improvements ongoing Ian.Bird@cern.ch40 Summary
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.