Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHEP – Mumbai, February 2006 State of Readiness of LHC Computing Infrastructure Jamie Shiers, CERN.

Similar presentations


Presentation on theme: "CHEP – Mumbai, February 2006 State of Readiness of LHC Computing Infrastructure Jamie Shiers, CERN."— Presentation transcript:

1 CHEP – Mumbai, February 2006 State of Readiness of LHC Computing Infrastructure Jamie Shiers, CERN

2 Introduction  Some attempts to define what “readiness” could mean  How we (will) actually measure it…  Where we stand today  What we have left to do – or can do in the time remaining…  Timeline to First Data  Related Talks  Summary & Conclusions

3 What are the requirements?  Since the last CHEP, we have seen:  The LHC Computing Model documents and Technical Design Reports;  The associated LCG Technical Design Report;  The finalisation of the LCG Memorandum of Understanding (MoU)  Together, these define not only the functionality required (Use Cases), but also the requirements in terms of Computing, Storage (disk & tape) and Network  But not necessarily in an site-accessible format…  We also have close-to-agreement on the Services that must be run at each participating site  Tier0, Tier1, Tier2, VO-variations (few) and specific requirements  We also have close-to-agreement on the roll-out of Service upgrades to address critical missing functionality  We have an on-going programme to ensure that the service delivered meets the requirements, including the essential validation by the experiments themselves

4 How do we measure success?  By measuring the service we deliver against the MoU targets  Data transfer rates;  Service availability and time to resolve problems;  Resources provisioned across the sites as well as measured usage…  By the “challenge” established at CHEP 2004:  [ The service ] “should not limit ability of physicist to exploit performance of detectors nor LHC’s physics potential“  “…whilst being stable, reliable and easy to use”  Preferably both…  Equally important is our state of readiness for startup / commissioning, that we know will be anything but steady state  [ Oh yes, and that favourite metric I’ve been saving… ]

5 LHC Startup  Startup schedule expected to be confirmed around March 2006  Working hypothesis remains ‘Summer 2007’  Lower than design luminosity & energy expected initially  But triggers will be opened so that data rate = nominal  Machine efficiency still an open question – look at previous machines???  Current targets:  Pilot production services from June 2006  Full production services from October 2006  Ramp up in capacity & throughput to TWICE NOMINAL by April 2007

6 LHC Commissioning Expect to be characterised by:  Poorly understood detectors, calibration, software, triggers etc.  Most likely no AOD or TAG from first pass – but ESD will be larger?  The pressure will be on to produce some results as soon as possible!  There will not be sufficient resources at CERN to handle the load  We need a fully functional distributed system, aka Grid  There are many Use Cases we did not yet clearly identify  Nor indeed test --- this remains to be done in the coming 9 months!

7 LCG Service Hierarchy Tier-0 – the accelerator centre  Data acquisition & initial processing  Long-term data curation  Distribution of data  Tier-1 centres Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschungszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Didcot) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 – “online” to the data acquisition process  high availability  Managed Mass Storage –  grid-enabled data service  Data intensive analysis  National, regional support  Continual reprocessing activity (or is that continuous?) Tier-2 – ~100 centres in ~40 countries  Simulation  End-user analysis – batch and interactive Les Robertson

8 The Dashboard  Sounds like a conventional problem for a ‘dashboard’  But there is not one single viewpoint…  Funding agency – how well are the resources provided being used?  VO manager – how well is my production proceeding?  Site administrator – are my services up and running? MoU targets?  Operations team – are there any alarms?  LHCC referee – how is the overall preparation progressing? Areas of concern?  …  Nevertheless, much of the information that would need to be collected is common…  So separate the collection from presentation (views…)  As well as the discussion on metrics…

9 The Requirements  Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network  Look at the Computing TDRs;  Look at the resources pledged by the sites (MoU etc.);  Look at the plans submitted by the sites regarding acquisition, installation and commissioning;  Measure what is currently (and historically) available; signal anomalies.  Functional requirements, in terms of services and service levels, including operations, problem resolution and support  Implicit / explicit requirements in Computing Models;  Agreements from Baseline Services Working Group and Task Forces;  Service Level definitions in MoU;  Measure what is currently (and historically) delivered; signal anomalies.  Data transfer rates – the TierX  TierY matrix  Understand Use Cases;  Measure … And test extensively, both ‘dteam’ and other VOs

10 The Requirements  Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network  Look at the Computing TDRs;  Look at the resources pledged by the sites (MoU etc.);  Look at the plans submitted by the sites regarding acquisition, installation and commissioning;  Measure what is currently (and historically) available.  Functional requirements, in terms of services and service levels, including operations, problem resolution and support  Implicit / explicit requirements in Computing Models;  Agreements from Baseline Services Working Group and Task Forces;  Service Level definitions in MoU;  Measure what is currently (and historically) delivered; signal anomalies.  Data transfer rates – the TierX  TierY matrix  Understand Use Cases;  Measure … And test extensively, both ‘dteam’ and other VOs

11 Resource Deployment and Usage Resource Requirements for 2008

12 Tier-0 Tier-2s Tier-1s CERN Analysis Facility ATLAS Resource Ramp-Up Needs

13 Site Planning Coordination  Site plans coordinated by LCG Planning Officer, Alberto Aimar  Plans are now collected in a standard format, updated quarterly  These allow tracking of progress towards agreed targets  Capacity ramp-up to MoU deliverables;  Installation and testing of key services;  Preparation for milestones, such as LCG Service Challenges…

14 Measured Delivered Capacity Various accounting summaries:  LHC View http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php  Data Aggregation across Countries  EGEE View http://www2.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php  Data Aggregation across EGEE ROC  GridPP View http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php  Specific view for GridPP accounting summaries for Tier-2s

15 The Requirements  Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network  Look at the Computing TDRs;  Look at the resources pledged by the sites (MoU etc.);  Look at the plans submitted by the sites regarding acquisition, installation and commissioning;  Measure what is currently (and historically) available.  Functional requirements, in terms of services and service levels, including operations, problem resolution and support  Implicit / explicit requirements in Computing Models;  Agreements from Baseline Services Working Group and Task Forces;  Service Level definitions in MoU;  Measure what is currently (and historically) delivered; signal anomalies.  Data transfer rates – the TierX  TierY matrix  Understand Use Cases;  Measure … And test extensively, both ‘dteam’ and other VOs

16 Reaching the MoU Service Targets  These define the (high level) services that must be provided by the different Tiers  They also define average availability targets and intervention / resolution times for downtime & degradation  These differ from TierN to TierN+1 (less stringent as N increases) but refer to the ‘compound services’, such as “acceptance of raw data from the Tier0 during accelerator operation”  Thus they depend on the availability of specific components – managed storage, reliable file transfer service, database services, …  Can only be addressed through a combination of appropriate:  Hardware; Middleware and Procedures  Careful Planning & Preparation  Well understood operational & support procedures & staffing

17 Same, COD-6, Barcelona 17 Service Monitoring - Introduction Service Availability Monitoring Environment (SAME) - uniform platform for monitoring all core services based on SFT experience Two main end users (and use cases): –project management - overall metrics –operators - alarms, detailed info for debugging, problem tracking A lot of work already done: –SFT and GStat are monitoring CEs and Site-BDIIs –Data schema (R-GMA) established –Basic displays in place (SFT report, CIC-on-duty dashboard, GStat) and can be reused

18 Service Level Definitions ClassDescriptionDowntimeReducedDegradedAvailability CCritical1 hour 4 hours99% HHigh4 hours6 hours 99% MMedium6 hours 12 hours99% LLow12 hours24 hours48 hours98% UUnmanagedNone Tier0 services: C/H, Tier1 services: H/M, Tier2 services M/L ServiceMaximum delay in responding to operational problemsAverage availability measured on an annual basis Service interruption Degradation … by more than 50% Degradation … by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 Centre during accelerator operation 12 hours 24 hours99%n/a Networking service to the Tier-0 Centre during accelerator operation 12 hours24 hours48 hours98%n/a Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outside accelerator operation 24 hours48 hours n/a98% All other services – prime service hours[1][1]2 hour 4 hours98% All other services – outside prime service hours 24 hours48 hours 97%

19 Service Functionality https://twiki.cern.ch/twiki/bin/view/LCG/Planning https://twiki.cern.ch/twiki/bin/view/LCG/Planning SC4 SERVICES - Planning Legenda Feature available with the next deployed release of LCG Feature that will be deployed in 2006 as available Feature not available References on: https://uimon.cern.ch/twiki/bin/view/LCG/SummaryOpenIssuesTF ID References Notes Comments References Hyperlinks Dependent Milestones GLITE RUNTIME ENVIRONMENT ALLRun-time environment compatible with experiments and application software 8.aOn going effort with Dirk and Andrea Software configuration compatible with experiments and application software 8.a AUTHORIZATION AND AUTHENTICATION VOMS v1.6.15 available features Groups and roles implemented by services1.bVOMS 1.6.15: LFC 1.4.3 Yes; DPM1.4.3 No (1.5.0 Yes); User Metadata1.cVOMS 1.6.15: User Alias retrievable from VOMS server with provided script. Not available in user proxy. Mirroring available1.aVOMS 1.6.15: High availability but not mirroring off site. Memory leaks in server. myproxy v0.6.1 Automatic proxy renewal for services (not VOMS enabled)1.dLFC 1.4.3 Not needed; DPM 1.4.3 No; FTS 1.5.0 No; RB gLite 3.0 OK. Proxy renewal service VOMS enabled1.dProvided bygLite 3.0 WMS. No services beside WMS are interfaced with it. Automatic proxy renewal for Kerberos1.eNot available INFORMATION SYSTEM Not availableCached access to static information2.aNot available

20 R.Bailey, Chamonix XV, January 200620 Breakdown of a normal year 7-8 ~ 140-160 days for physics per year Not forgetting ion and TOTEM operation Leaves ~ 100-120 days for proton luminosity running ? Efficiency for physics 50% ? ~ 50 days ~ 1200 h ~ 4 10 6 s of proton luminosity running / year - From Chamonix XIV - Service upgrade slots?

21 Site & User Support  Ready to move to single entry point now  Target is to replace all interim mailing lists prior to SC4 Service Phase  i.e. by end – May for 1 st June start  Send mail to helpdesk@ggus.org | VO-user-support@ggus.orghelpdesk@ggus.orgVO-user-support@ggus.org  Also portal at www.ggus.orgwww.ggus.org

22 Enabling Grids for E-sciencE INFSO-RI-508833 CERN - Computing Challenges 22 PPS & WLCG Operations Production-like operation procedures and tools need to be introduced in PPS –Must re-use as much as possible from production service. –This has already started (SFT, site registration) but we need to finish this very quickly – end of February? PPS operations must be taken over by COD –Target proposed at last “COD meeting” was end March 2006 This is a natural step also for “WLCG production operations” And is consistent with the SC4 schedule –Production Services from beginning of June 2006

23 The Requirements  Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network  Look at the Computing TDRs;  Look at the resources pledged by the sites (MoU etc.);  Look at the plans submitted by the sites regarding acquisition, installation and commissioning;  Measure what is currently (and historically) available.  Functional requirements, in terms of services and service levels, including operations, problem resolution and support  Implicit / explicit requirements in Computing Models;  Agreements from Baseline Services Working Group and Task Forces;  Service Level definitions in MoU;  Measure what is currently (and historically) delivered; signal anomalies.  Data transfer rates – the TierX  TierY matrix  Understand Use Cases;  Measure … And test extensively, both ‘dteam’ and other VOs

24 Summary of Tier0/1/2 Roles  Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times;  Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s;  Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction. N.B. there are differences in roles by experiment Essential to test using complete production chain of each!

25 CentreALICEATLASCMSLHCbRate into T1 (pp) MB/s ASGC, Taipei- -100 CNAF, Italy 200 PIC, Spain- 100 IN2P3, Lyon 200 GridKA, Germany 200 RAL, UK- 150 BNL, USA- --200 FNAL, USA-- -200 TRIUMF, Canada- --50 NIKHEF/SARA, NL - 150 Nordic Data Grid Facility --50 Totals----1,600 Sustained Average Data Rates to Tier1 Sites (To Tape) Need additional capacity to recover from inevitable interruptions…

26 LCG OPN Status  Based on expected data rates during pp and AA running, 10Gbit/s networks are required between the Tier0 and all Tier1s  Inter-Tier1 traffic (reprocessing and other Use Cases) was one of the key topics discussed at the SC4 workshop this weekend, together with TierX  TierY needs for analysis data, calibration activities and other studies  A number of sites already have their 10Gbit/s links in operation  The remaining are expected during the course of the year

27 LHC Parameters (Computing Models) Yearpp operationsHeavy Ion operations Beam time (seconds/year) Luminosity (cm -2 s -1 ) Beam time (seconds/year) Luminosity (cm -2 s -1 ) 20075 x 10 6 5 x 10 32 -- 2008(1.8 x) 10 7 2 x 10 33 (2.6 x) 10 6 5 x 10 26 200910 7 2 x 10 33 10 6 5 x 10 26 201010 7 10 34 10 6 5 x 10 26 (Real time given in brackets above)

28 NominalThese are the raw figures produced by multiplying e.g. event size x trigger rate. HeadroomA factor of 1.5 that is applied to cater for peak rates. EfficiencyA factor of 2 to ensure networks run at less than 50% load. RecoveryA factor of 2 to ensure that backlogs can be cleared within 24 – 48 hours and to allow the load from a failed Tier1 to be switched over to others. Total Requirement A factor of 6 must be applied to the nominal values to obtain the bandwidth that must be provisioned. Arguably this is an over-estimate, as “Recovery” and “Peak load” conditions are presumably relatively infrequent, and can also be smoothed out using appropriately sized transfer buffers. But as there may be under-estimates elsewhere…

29 Service Challenges: Key Principles  Service challenges result in a series of services that exist in parallel with baseline production service  Rapidly and successively approach production needs of LHC  Initial focus: core (data management) services  Swiftly expand out to cover full spectrum of production and analysis chain  Must be as realistic as possible, including end-end testing of key experiment use-cases over extended periods with recovery from glitches and longer-term outages  Necessary resources and commitment pre-requisite to success!  Effort should not be under-estimated!

30 Service Challenge Throughput Tests  Currently focussing on Tier0  Tier1 transfers with modest Tier2  Tier1 upload (simulated data) Recently achieved target of 1GB/s out of CERN with rates into Tier1s at or close to nominal rates  Still much work to do!  We still do not have the stability required / desired…  The daily average needs to meet / exceed targets  We need to handle this without “heroic efforts” at all times of day / night!  We need to sustain this over many (100) days  We need to test recovery from problems (individual sites – also Tier0)  We need these rates to tape at Tier1s (currently disk)  Agree on milestones for TierX  TierY transfers & demonstrate readiness

31 Achieved (Nominal) pp data rates CentreALICEATLASCMSLHCbRate into T1 (pp) Disk-Disk (SRM) rates in MB/s ASGC, Taipei- -80 (100) (have hit 140) CNAF, Italy 200 PIC, Spain- >30 (100) (network constraints) IN2P3, Lyon 200 GridKA, Germany 200 RAL, UK- 200 (150) BNL, USA- --150 (200) FNAL, USA-- ->200 (200) TRIUMF, Canada- --140 (50) SARA, NL - 250 (150) Nordic Data Grid Facility --150 (50) Meeting or exceeding nominal rate (disk – disk) Met target rate for SC3 (disk & tape) re-run Missing: rock solid stability - nominal tape rates SC4 T0-T1 throughput goals: nominal rates to disk (April) and tape (July) To come: Srm copy support in FTS; CASTOR2 at remote sites; SLC4 at CERN; Network upgrades etc.

32 CMS Tier1 – Tier1 Transfers  In the CMS computing model the Tier-1 to Tier-1 transfers are reasonably small.  The Tier-1 centers are used for re-reconstruction of events so Reconstructed events from some samples and analysis objects from all samples are replicated between Tier-1 centers. Goal for Tier-1 to Tier-1 transfers:  FNAL -> One Tier-1 1TB per day February 2006  FNAL -> Two Tier-1's 1TB per day each March 2006  FNAL -> 6Tier-1 Centers 1TB per day each July 2006  FNAL -> One Tier-1 4TB per day July 2006  FNAL -> Two Tier-1s 4TB per day each November 2006 Ian Fisk 1 day = 86,400s ~10 5 s ATLAS – 2 copies of ESD?

33 33 SC4 milestones (2) Tier-1 to Tier-2 Transfers (target rate 300-500Mb/s) Sustained transfer of 1TB data to 20% sites by end December Sustained transfer of 1TB data from 20% sites by end December Sustained transfer of 1TB data to 50% sites by end January Sustained transfer of 1TB data from 50% sites by end January Peak rate tests undertaken for the two largest Tier-2 sites in each Tier-2 by end February Sustained individual transfers (>1TB continuous) to all sites completed by mid-March Sustained individual transfers (>1TB continuous) from all sites completed by mid-March Peak rate tests undertaken for all sites by end March Aggregate Tier-2 to Tier-1 tests completed at target rate (rate TBC) by end March Tier-2 Transfers (target rate 100 Mb/s) Sustained transfer of 1TB data between largest site in each Tier-2 to that of another Tier-2 by end February Peak rate tests undertaken for 50% sites in each Tier-2 by end February

34 June 12-14 2006 “Tier2” Workshop  Focus on analysis Use Cases and Tier2s in particular  List of Tier2s reasonably well established  Try to attract as many as possible!  Some 20+ already active – target of 40 by September 2006!  Still many to bring up to speed – re-use experience of existing sites!  Important to understand key data flows  How experiments will decide which data goes where  Where does a Tier2 archive its MC data?  Where does it download the relevant Analysis data?  The models have evolved significantly over the past year!  Two-three day workshop followed by 1-2 days of tutorials Bringing remaining sites into play: Identifying remaining Use Cases

35 Summary of Key Issues  There are clearly many areas where a great deal still remains to be done, including:  Getting stable, reliable, data transfers up to full rates  Identifying and testing all other data transfer needs  Understanding experiments’ data placement policy  Bringing services up to required level – functionality, availability, (operations, support, upgrade schedule, …)  Delivery and commissioning of needed resources  Enabling remaining sites to rapidly and effectively participate  Accurate and concise monitoring, reporting and accounting  Documentation, training, information dissemination…

36 And Those Other Use Cases? 1.A small 1 TB dataset transported at "highest priority" to a Tier1 or a Tier2 or even a user group where CPU resources are available.  I would give it 3 Gbps so I can support 2 of them at once (max in the presence of other flows and some headroom). So this takes 45 minutes. 2.10 TB needs to moved from one Tier1 to another or a large Tier2.  It takes 450 minutes, as above so only ~two per day can be supported per 10G link.

37 Timeline - 2006 JanuarySC3 disk repeat – nominal rates capped at 150MB/s SRM 2.1 delivered (?) JulyTape Throughput tests at full nominal rates! FebruaryCHEP w/s – T1-T1 Use Cases, SC3 disk – tape repeat (50MB/s, 5 drives) AugustT2 Milestones – debugging of tape results if needed MarchDetailed plan for SC4 service agreed (M/W + DM service enhancements) SeptemberLHCC review – rerun of tape tests if required? AprilSC4 disk – disk (nominal) and disk – tape (reduced) throughput tests OctoberWLCG Service Officially opened. Capacity continues to build up. MayDeployment of new M/W and DM services across sites – extensive testing November1 st WLCG ‘conference’ All sites have network / tape h/w in production(?) JuneSC4 production - Tests by experiments of ‘T1 Use Cases’. ‘Tier2 workshop’ – identification of key Use Cases and Milestones for T2s December‘Final’ service / middleware review leading to early 2007 upgrades for LHC data taking?? O/S Upgrade? (SLC4) Sometime before April 2007!

38 The Dashboard Again…

39 (Some) Related Talks  The LHC Computing Grid Service (plenary)  BNL Wide Area Data Transfer for RHIC and ATLAS: Experience and Plans  CMS experience in LCG SC3  The LCG Service Challenges - Results from the Throughput Tests and Service Deployment  Global Grid User Support: the model and experience in the Worldwide LHC Computing Grid  The gLite File Transfer Service: Middleware Lessons Learned from the Service Challenges

40 Summary  In the 3 key areas addressed by the WLCG MoU:  Data transfer rates;  Service availability and time to resolve problems;  Resources provisioned. we have made good – sometimes excellent - progress over the last year.  There still remains an a huge amount to do, but we have a clear plan of how to address these issues.  Need to be pragmatic, focussed and work together on our common goals.

41


Download ppt "CHEP – Mumbai, February 2006 State of Readiness of LHC Computing Infrastructure Jamie Shiers, CERN."

Similar presentations


Ads by Google