Ian Bird LCG Project Leader WLCG Collaboration Issues WLCG Collaboration Board 24 th April 2008
2 Strategic Issues A number of aspects of WLCG where we see the need for some structuring of dialogue with the Tier 2 federations: Reliabilities Accounting Resource pledges/installed capacity Milestones Other issues that are arising: Engagement in EGI/NGI (etc) for future infrastructures Resource procurement schedules/delays/process General aspects of Tier 2 coordination/information flow: Information from MB, engagement in GDB Technical points – how to discuss with Tier 2s: Move to SL5/6; pilot jobs; fabric monitoring/tools; what tools do Tier 2s miss? What is the voice of the Tier 2’s ?
3 Recent grid use Across all grid infrastructures Preparation for, and execution of CCRC’08 phase 1 Move of simulations to Tier 2s Tier 2: 54% CERN: 11% Tier 1: 35% Federations not yet reporting: Finland India (IN-INDIACMS-TIFR) Norway Sweden Ukraine Federations not yet reporting: Finland India (IN-INDIACMS-TIFR) Norway Sweden Ukraine
4 Accounting for Tier-2s (1) Test reporting took place in summer 2007 and formal reporting started from September Monthly reports are now produced, circulated for comment and published on the LCG Project Planning website. Currently the 52 of the 57 Federations are reporting accounting data over a total of 107 sites: Changes still being signaled for site names therefore situation not yet fully stable Some Federations provided pledge information from 2008 onwards and will be included in the reporting from April Follow-up required with Finland, India, Norway, Sweden and Ukraine to include them in the accounting reporting Slide 5 shows the global picture of reporting by country from September 2007-February Slides 6 and 7 show the comparison of MoU pledge with CPU provided split according to size of pledge. Sue Foffano – CERN-IT-4
5 Accounting for Tier-2s (2) Sue Foffano – CERN-IT-5
6 Accounting for Tier-2s (3) Sue Foffano – CERN-IT-6
7 Accounting for Tier-2s (4) Sue Foffano – CERN-IT-7 What we don’t see here is the installed capacity
8
9 Computing Resource Pledge Responsibilities Following the pledge revision exercise of Autumn 2007 a reminder of the process is felt necessary. Autumn C-RRB meeting each Federation is expected to provide: Firm commitment to pledge values for the following year Planned pledge values for the subsequent 4 years Spring C-RRB meeting each Federation is expected to: Confirm that pledge values for the current year are installed and running a production service, or explain any problems for the current year or changes for future years 2 weeks before the next C-RRB on 11/11/08 the following is therefore required: Confirmed 2009 pledge values (confirmation of already communicated value, or revised upwards) Planned pledge values inclusive (confirmation or revision of already communicated values, ) Sue Foffano – CERN-IT-9
10 Tier 0/Tier 1 Site reliability Target: Sites 91% & 93% from December 8 best: 93% and 95% from December See QR for full status Sep 07Oct 07Nov 07Dec 07Jan 08Feb 08 All89%86%92%87%89%84% 8 best93% 95% 96% Above target (+>90% target) Follow up process in MB over many months with individual sites
11 Tier 2 Reliabilities Reliabilities published regularly since October In February 47 sites had > 90% reliability OverallTop 50%Top 20%Sites 76%95%99%89 100 For the Tier 2 sites reporting: For Tier 2 sites not reporting, 12 are in top 20 for CPU delivered SitesTop 50% Top 20% Sites> 90% %CPU72%40%70% Jan 08 How do we address this?
12
13 How should the federations be reported - weighted? How should the federations be reported - weighted?
14 Reliability reporting Currently (Feb 08) All Tier 1 and 100 Tier 2 sites report reliabilities Recent progress: MB set up group to Agreement on equivalence of NDGF tests with those used at EGEE and all other Tier 1 sites – now in production at NDGF Should also be used for Nordic Tier 2 sites Similar process with OSG (for US Tier 2 sites): tests only for CE so far, agreement on equivalence, tests are in production, publication to SAM in progress Missing – SE/SRM testing Expect full production May 2008 (new milestone introduced) Important that we have all Tier 2s regularly tested and reporting Important that we have correct Tier 2 federation contact to follow up these issues
15 Reporting Urgent now that: Remaining Tier 2 federations start reporting on reliabilities and accounting Follow up monthly in checking the published data – we have to understand if there are problems in the process If the site names are wrong – please tell us what they should be (and how they map to the physical site host names) Resource installation We need to gather also information about installed resources at Tier 2s Follow up process: For Tier 1s this was done monthly in the MB, site by site – was manageable but slow; with Tier 2s this process is unwieldy (110+ sites) Need a contact person for each federation, and would be far more convenient to have a contact for each country
WLCG April 2008: Tier 0 and 1 Resources16 Updated Resource Status Summary for May CCRC’08 For 5 May not all sites will now have their full 2008 cpu pledges available, a total of KSi2K (9600 KSi2K more than in 1Q2008 but a drop of 8000 from Feb plans). Largest missing sites are KSi2K at NL-T1 due November 2008, KSi2K at CNAF due June, KSi2K at US-CMS due end May and KSi2K at US-ATLAS due early June. For disk and tape many sites will catch up later in the year as need expands: 2008 disk requirements are 23 PB and 12.4 PB are expected to be available for 5 May (3 PB more than in 1Q2008 but a drop of 3.1 from Feb plans) while 2008 tape requirements are 24 PB and 13.6 PB are expected to be available for 5 May (4.8 PB more than in 1Q2008 but a drop of 1.4 PB from Feb plans). Disk and tape storage for May full scale dress rehearsal run of CCRC’08 are probably better modelled by requiring 55% (accelerator efficiency) times 30/100 (days running) of the increased resource requirements for 2008/9 over those of 2007/8 so 2.8 PB of disk and 3 PB of tape. Globally not a problem but some sites will not be able to fully contribute to the May CCRC if this model is correct. These requirements are to be modified with the specific April 2008 experiment requirements to be given in the next talks.
WLCG April 2008: Tier 0 and 1 Resources17 Summary of Disk Space Plans As usual the most critical resource: – ASGC: Last 300 TB delivery end June – CC-IN2P3: Last 880 TB planned for September – FZK: Last 650 TB planned for October (600 ALICE, 50 CMS) – CNAF: Last 730 TB planned for June/July – NDGF: Grow as needed reaching last 700 TB by Autumn – NL-T1: Add 800 TB by end May and last 1450 TB in November – PIC: Last 370 TB planned for early June. – RAL: Last 800 TB in acceptance, ready for end May. – TRIUMF: Full pledge for May CCRC – US-ATLAS: Add 1200 TB by end May and last 1000 TB in October – US-CMS: Full pledge for May CCRC
18 Resource procurement This risks to be a major problem in the coming years Important to work around the procurement processes so that we can be ready for the accelerator running each year Has been a problem for almost all Tier 1s. Is this also an issue for Tier 2s? 18
19 Milestones The project has mostly had formal milestones associated with the project, Tier 0, Tier 1s It is now time to start to impose milestones on the Tier 2s for specific issues: E.g. Reliability, resource installation, etc. Again, will be important to have the appropriate technical coordinators to report and follow up on these issues
20 Communication Apart from the issues raised above, How are the Tier 2s kept informed, and does it work? Flow of information from Management Board, - do Tier 2s read the minutes? Is everyone engaged in the GDB (or even aware that they can be)? How can we structure the communication with the great number of Tier 2 sites, so that we can have a workable process to communicate problems and follow up (in both directions)?? How can we aggregate Tier 2 status to report in LHCC/OB/RRB/CB etc? Today it is extremely difficult to get an overview of Tier 2 status and problems
21 Miscellaneous technical issues Move to new versions of the OS – SL5/SL6 Pilot jobs/glexec – is it OK for sites to deploy this now? Fabric monitoring – do Tier 2s do this sufficiently? Do they have the tools? Security tools? – are sites appropriately protected? What tools do Tier 2s miss? How do Tier 2s keep abreast of these developments? Should participate in the GDB Is more needed?
22 Comments on EGI design study Goal is to have a fairly complete blueprint in June Main functions presented to NGIs in Rome workshop in March Essential for WLCG that EGI/NGI continue to provide support for the production infrastructure after EGEE-III We need to see a clear transition and assurance of appropriate levels of support; Transition will be Exactly the time that LHC services should not be disrupted Concerns: NGIs agreed that a large European production-quality infrastructure is a goal Not clear that there is agreement on the scope Reluctance to accept level of functionality required Tier 1 sites (and existing EGEE expertise) not well represented by many NGIs WLCG representatives must approach their NGI reps and ensure that EGI/NGIs provide the support we need These comments apply equally to Tier 2s - they really need to engage with the NGI in their countries
23 EGI/NGI cont. While WLCG should work hard to make sure that the EGI design study goes in the right direction, Strategically the project must be prepared to plan for a fall-back Tier 1s were questioned in the OB – all replied that they had some plan in place if there were no EGI/NGI Albeit with a potential reduction in what they could contribute We need to start thinking about what the Tier 2s can do It will be clear in June whether the EGI_DS blueprint provides what we need Put together a group to begin to look at fallback plans for Tier 2s?
24 Summary A number of aspects of WLCG where we see the need for some structuring of dialogue with the Tier 2 federations: General aspects of Tier 2 coordination/information flow: Information from MB, engagement in GDB Technical points: Move to SL5/6; pilot jobs; fabric monitoring/tools; what tools do Tier 2s miss? What is the voice of the Tier 2’s ? Do we need a group to start looking at Tier 2 fallback plans if EGI_DS does not deliver? And what is the situation in US with OSG?