GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle.

GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Outline Exec 2 Summary 2006 Outturn The Year Ahead.. 2007 Status Some problems to solve.. “All 6’s and 7’s”? 2007

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Exec 2 Summary 2006 was the second full year for the UK Production Grid More than 5,000 CPUs and more than 1/2 Petabyte of disk storage The UK is the largest CPU provider on the EGEE Grid, with total CPU used of 15 GSI2k-hours in 2006 The GridPP2 project has met 69% of its original targets with 92% of the metrics within specification The initial LCG Grid Service is now starting and will run for the first 6 months of 2007 The aim is to continue to improve reliability and performance ready for startup of the full Grid service on 1st July 2007 The GridPP2 project has been extended by 7 months to April 2008 The outcome of the GridPP3 proposal is now known We anticipate a challenging period in the year ahead

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Grid Overview Aim: by 2008 (full year’s data taking) -CPU ~100MSI2k (100,000 CPUs) -Storage ~80PB - Involving >100 institutes worldwide -Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT) 1.Prototype went live in September 2003 in 12 countries 2.Extensively tested by the LHC experiments in September 2004 3.February 2006 25,547 CPUs, 4398 TB storage Status in 2007: 177 sites, 32,412 CPUs, 13,282 TB storage Monitoring via Grid Operations Centre

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Resources 2006 CPU Usage by Region Via APEL accounting

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 2006 Outturn Definitions: "Promised" is the total that was planned at the Tier-1/A (in the March 2005 planning) and Tier-2s (in the October 2004 Tier-2 MoU) for CPU and storage "Delivered" is the total that was physically installed for use by GridPP, including LCG and SAMGrid at Tier-2 and LCG and BaBar at Tier-1/A "Available" is available for LCG Grid use, i.e. declared via the EGEE mechanisms with storage via an SRM interface "Used" is as accounted for by the Grid Operations Centre

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Resources Delivered CPU KSI2KStorage TB PromisedDeliveredRatioPromisedDeliveredRatio Brunel155480310%216.330% Imperial116580769%93.360.365% QMUL9171209132%58.51831% RHUL20416380%23.28.838% UCL60121202%0.71.1149% Lancaster51048495%86.77283% Liverpool60559298%80.32.83% Manchester13051840141%372.614539% Sheffield183 100%3267% Durham8699115%5479% Edinburgh711152%70.52028% Glasgow246800325%14.840270% Birmingham196223114%9.3 100% Bristol391231%1.93.8200% Cambridge3340123%4.4 101% Oxford41415036%24.527110% RAL PPD199320161%17.466.1381% London25012780111%196.794.448% NorthGrid26023099119%542.6221.841% ScotGrid340910268%90.36471% SouthGrid88074585%57.5110.6192% Total63227534119%887.1490.855% Tier-11604103464%149571248% Tier-1 and Tier-2 total delivery is impressive and usage is improved Available CPU: 8.5 MSI2k Storage: 1.7 PB Disk: 0.54 PB Delivery of Tier-1 disk Used CPU:15 GSI2k-hours Disk: 0.26 PB Usage of Tier-2 CPU, disk Request: PPARC acceptance of the 2006 outturn (next week) 2006 Outturn

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Available (KSI2K)Used (KSI2K Hours)Ratio 1Q062Q063Q064Q061Q062Q063Q064Q061Q062Q063Q064Q06 Brunel116 48012,811105,014159,082643,8060.70%41.30%62.60%61.20% Imperial203 64216,82883,62782,593557,9432.20%18.80%18.60%39.70% QMUL2811209 214,335612,564459,4271,259,44613.70%23.10%17.40%47.60% RHUL163 25,08521,940176,046147,36017.20%6.10%49.30%41.30% UCL121 42,21751,10673,763156,57616.00%19.30%27.80%59.10% Lancaster485476473 248,463402,774210,432297,55014.90%38.60%20.30%28.70% Liverpool572592 56,218455,72740,551164,22211.80%35.20%3.10%12.70% Manchester48072011521840380,8571,042,154248,704370,5670.30%66.10%9.90%9.20% Sheffield240183 38,41159,86078,795127,0392.00%15.00%19.70%31.80% Durham7280 36,69958,18533,67159,1234.30%33.20%19.20%33.70% Edinburgh666614,8294,6373,6414,91834.40%35.30%27.70%37.40% Glasgow104 4780075,77450,46272,105155,9866.30%22.20%70.10%8.90% Birmingham6223 38,47331,79528,29953,93013.20%62.00%55.20%105.20% Bristol17778427,2088,9826,59315.10%45.70%57.00%41.80% Cambridge4038 6542,2282,4421,8110.00%2.70%2.90%2.20% Oxford6765 94,09392,84182,28463,95934.30%65.40%58.00%45.10% RAL PPD73 320 106,919132,046143,648235,17259.90%82.10%20.50%33.60% London8841812 2615311,276874,251950,9112,765,13110.30%22.00%24.00%48.30% NorthGrid1777197123993087723,9491,960,515578,482959,37810.10%45.40%11.00%14.20% ScotGrid182190133886127,302113,284109,417220,0276.40%27.20%37.60%11.30% SouthGrid243207453 240,981266,118265,655361,46531.00%58.80%26.80%36.40% Total Tier-230864179479770411,403,5083,214,1681,904,4654,306,00112.20%35.10%18.10%27.90% Tier-1415620651848624,6361,089,9171,393,022992,10668.70%80.20%97.60%53.40% LCG CPU Usage

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 (measured by UK Tier-1 for all VOs) ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this fell to ~75% Efficiency Each experiment needs to work to improve their system/deployment practice anticipating e.g. hanging gridftp connections during batch work target

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 CPU by experiment Used at Tier-2 (KSI2K Hours)Used at Tier-1 (KSI2K Hours) 1Q062Q063Q064Q061Q062Q063Q064Q06 ALICE 04321879031171223013936757 ATLAS8520141131060569879800194156114323195256979253869 CMS12579423640948912236842777025176198407072170784 LHCb1643391237858718268107283821210404707634341396417 BaBar41854729329159314542547756163615853501 CDF 1517 D0933731026024054122106995963530912243327515 H13851440182301384603058170831845980 ZEUS496523170414060736690620953135319815 Other115407232299486711341548159326384478 LHC11421472605327177770122416462633809212221328531857827 Total140159730803491859426257472362463010899171393013906216

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 2006 CPU Usage by experiment UK Resources

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 LCG Disk Usage Available (TB)Used (TB)Ratio 1Q062Q063Q064Q061Q062Q063Q064Q061Q062Q063Q064Q06 Brunel 1.51.14.7 0.10.24.3 6.70%18.10%91.10% Imperial0.33.25.635.40.32.22.925.588.80%69.40%51.70%72.00% QMUL18.215.918.2 14.33.63.44.878.40%22.60%18.40%26.40% RHUL2.7 5.52.50.30.21.590.50%10.60%7.70%27.30% UCL1.10120.900.31.482.60%54.30%32.60%70.00% Lancaster63.453.147.76029.913.126.912.847.10%24.70%56.30%21.30% Liverpool 2.80.62.8000.11.4 0.80%16.30%50.00% Manchester 66.967.6176.801.93.95.4 2.80%5.80%3.10% Sheffield4.53.92.32.24.41.20.30.195.80%32.10%12.40%4.50% Durham1.9 3.5 0.61.30.91.230.90%68.10%25.40%34.30% Edinburgh3130292016.613.52.83.953.60%45.10%9.50%19.50% Glasgow4.3 1.6343.80.61.14.189.90%15.00%70.80%12.10% Birmingham1.8 1.91.81.30.60.81.373.30%31.80%41.60%72.20% Bristol0.2 2.11.80.200.30.489.60%12.00%16.00%22.20% Cambridge3.2 33.1 00.82.194.70%0.60%26.30%67.70% Oxford3.21.63.2 2.5000.580.10%1.10%0.00%15.60% RAL PPD6.8 6.416.66.40.60.313.593.50%9.40%4.20%81.30% London22.423.428.765.817.96.2737.580.30%26.60%24.40%57.00% NorthGrid67.9126.7118.2241.834.216.231.219.750.40%12.80%26.40%8.10% ScotGrid37.136.234.157.52115.54.89.256.60%42.80%14.00%16.00% SouthGrid15.213.616.626.513.41.32.217.888.60%9.30%13.20%67.20% Total Tier-2142.5199.8197.5391.686.639.245.184.260.70%19.60%22.80%21.50% Tier-1121.1114.4123.1145.356.4107.2149.4177.746.60%93.70%121.40%122.30%

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 (individual rates) Aim: to maintain data transfers at a sustainable level as part of experiment service challenges http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary File Transfers Current goals:goals >250Mb/s inbound-only >300-500Mb/s outbound-only >200Mb/s inbound and outbound

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 200620082007 GridPP2GridPP2+ GridPP3 First Collisions 14 TeV Collisions 900GeV 14TeV LHC: The Year Ahead.. Don’t Panic 2001200220032004200520062007 EDG EGEE-IEGEE-II LHC Data Taking GridPP1 GridPP2GridPP3 EGI ?

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Magnet & Cosmics Test (August 06) Detector Lowering (January 07) CMS

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Job Receiver Job Receiver Job JDL Sandbox Job Input JobDB Job Receiver Job Receiver Job Receiver Job Receiver Data Optimizer Data Optimizer Task Queue LFC checkData Agent Director Agent Director checkJob RB Pilot Job CE WN Pilot Agent Pilot Agent Job Wrapper Job Wrapper execute User Application User Application fork Matcher CE JDL Job JDL getReplicas SE uploadData VO-box putRequest Agent Monitor Agent Monitor checkPilot getSandbox Job Monitor Job Monitor DIRAC services DIRAC services LCG services LCG services Workload On WN Workload On WN DIRAC WMS

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 The longest running Data Challenge in ALICE –A comprehensive test of the ALICE Computing model –Running already for 9 months non-stop: approaching data taking regime of operation –Participating: 55 computing centres on 4 continents: 6 Tier 1s, 49 T2s –7MSI2k hours  1500 CPUs running continuously 685K Grid jobs total 530K production 53K DAQ 102K user !!! 40M evts, 0.5PB generated, reconstructed and stored User analysis ongoing FTS tests T0->T1 Sep-Dec Design goal 300MB/s reached but not maintained 0.7PB DAQ data registered ALICE PDC’06

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Continued testing of computing models, basic services Testing DAQ  Tier-0 & integrating into DAQ  Tier-0  Tier-1 data flow Building up end-user analysis support Exercising the computing systems, ramping up job rates, data management performance, …. WLCG Commissioning Schedule 2006 2007 2008 SC4 – becomes initial service when reliability and performance goals met 01jul07 - service commissioned - full 2007 capacity, performance Initial service commissioning – increase performance, reliability, capacity to target levels, experience in monitoring, 24 x 7 operation, …. Introduce residual services Full FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4 first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1). Continue DC mode, as per WLHC commissioning AliRoot & Condition fwks SEs & Job priorities Combined T0 test DA for calibration ready Finalisation of CAF & Grid The real thing The Year Ahead..

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Resources 2007 CPU Usage by Region Via APEL accounting

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 2007 CPU Usage by experiment UK Resources

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Hardware Outlook Planning for 2007.. A profiled ramp-up of resources is planned throughout 2007 to meet the UK requirements of the LHC and other experiments The results are available for the Tier-1 and Tier-2sTier-1Tier-2s The Tier-1/A Board reviewed UK input to International MoU negotiations for the LHC experiments as well as providing input to the International Finance Committee for BaBar For LCG, the 2007 commitment for disk and CPU capacity can be met out of existing hardware already delivered The Year Ahead..

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 e.g. Glasgow: UKI-SCOTGRID-GLASGOW 800 kSI2k 100 TB DPM Needed for LHC s t a rt- u p August 28 September 1 October 13 October 23 T2 Resources IC-HEP 440 KSI2K 52 TB dCache Brunel 260 KSI2K 5 TB DPM

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 (measured by UK Tier-1 for all VOs) ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this is falling further Efficiency Current transition from dCache to CASTOR at the Tier-1 contributes to the problem [see Andrew’s talk] NB March is a mid-month figure Each experiment needs to work to improve their system

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Many problems identified and fixed at individual sites (GridPP DTeam) Other ‘Generic’ system failures that need to be addressed before fit for widespread use by inexperienced users Production teams mostly ‘work round’ these Users can’t/won’t More sites and tests introduced System failures ATLAS User Tests

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Status (Sunday) Production jobs now outnumbered by analysis and unknown jobs Analysis (CRAB) efficiency OK? e.g. RAL 93.3% CMS User Jobs http://lxarda09.cern.ch/dashboard/request.py/jobsummary

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Data recorded in the experiment dashboards –Initially only data from CMS (dashboard) –Now more and more data from ATLAS as well –CMS: mostly analysis; ATLAS: dominated by production –We expect to have “all” type of jobs soon “Job RetryCount” family48757 Job proxy is expired17465 Cannot plan: BrokerHelper: no compatible resources16646 Job got an error while in the CondorG queue5694 Cannot retrieve previous matches for …2410 Job successfully submitted to Globus948 Unable to receive data291 Popular(?) Messages

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Use RBs at RAL (2) and Imperial Broke about once a week and all jobs lost or in limbo –Never clear to user why Switch to a different RB –Users don’t know how to do this Barely usable for bulk submission – too much latency Can barely submit and query ~20 jobs in 20 mins before next submission –Users will want to do more than this Cancelling jobs doesn’t work properly – often fails and repeated attempts cause RB to fall over –Users will not cancel jobs (We know EDG RB is deprecated but gLite RB isn’t currently deployed) Work ongoing to improve RB availability and BDII (failover system) at Tier-1.. Resource Broker

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 lcg-info is used to find out what version of ATLAS software is available before submitting a job to a site but it is too unreliable and previous answer needs to be kept track of ldap query typically gives quick, reliable answer but lcg-info doesn’t The lcg-info command is very slow (querying *.ac.uk or xxx.ac.uk) and often fails completely Different bdiis seem to give different results and it is not clear to users which one to use (if the default fails) Many problems with UK SE's have made the creation of replicas painful - it is not helped by frequent bdii timeouts The FDR freedom of choice tool causes some problems because sites fail SAM tests because the job queues are full Information System

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 User Interface Users need local UIs (where their files are) These can be set up by local system managers but generally these are not Grid experts The local UI setup controls what RB, BDII, LFC etc all the users using that UI get and these appear to be pretty random –There needs to be clear guidance on which of these to use and how to change them if things go wrong Proxy Certificates These cause a lot of grief as the default 12 hours is not long enough If the certificate expires it's not always clear from the error messages when running jobs fail They can be created with longer lifetimes but this starts to violate security policies –Users will violate these policies Maybe MyProxy solves this but do users know? UI and Proxies

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 GGUS is used to report some of these problems but it is not very satisfactory The initial response is usually quite quick saying it has been passed to X but then the response after that is very patchy Usually there is some sort of acknowledgement but rarely a solution and often the ticket is never closed even if the problem was transitory and now irrelevant There are two particular cases which GGUS does not handle well: a) Something breaks and probably just needs to be rebooted: the system is just too slow and it's better to email someone (if you know whom) b) Something breaks and is rebooted/repaired etc but the underlying cause is a bug in the middleware: this doesn't seem to be fed back to the developers There are also of course some known problems that take ages to be fixed (e.g. the globus port range bug, rfio libraries,...) More generally, the GGUS system is working at the ~tens (up to 100) of tickets/week level but may not scale as new users start using the system GGUS

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Usability The Grid is a great success for Monte Carlo production However it is not in a fit state for a basic user analysis The tools are not suitable for bulk operations by normal users –Current users therefore set up ad-hoc scripts that can be mis-configured ‘System failures’ are too frequent (largely independent of the VO, probably location-independent) The User experience is poor –Improved overall system stability is needed –Template UI configuration (being worked on) will help –Wider adoption of VO-specific user interfaces may help –Users need more (directed) guidance There is not long to go –Usability task force required?

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 The Year Ahead.. 3D tested by ATLAS and LHCb 3D used for the conditions DB SRM 2.2 implementations SRM 2.2 tested by experiments SLC4 Migration gLite CE new RB FTS v2 VOMS scheduling priorities 24x7 definition and 24x7 test scenario VO boxes SLA VO boxes implementation Accounting data into APEL repository Automated Accounting reports Tier-1 Tier-1 & Tier-2 Tier-1 Tier-1 & Tier-2 Experiments Tier-1 Tier-1 & Tier-2 GOC

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 FTS 2.0: schedule FTS 2.0 currently deployed on pilot service@CERN –In testing since December: running dteam tests to stress-test it –This is the ‘uncertified’ code Next step: open pilot to experiments to verify full backwards compatibility with experiment code –Arranging this now Deployment at CERN T0 scheduled in April 2007 –Goal is April 1, but this is tight –Subject to successful verification Roll-out to T1 sites a month after that –We expect at least one update will be needed to the April version The Year Ahead.. Example:

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 The Year Ahead.. Site Reliability Tier-0, Tier-1 2007 SAM targets monthly average Target for each site - 91% - by Jun 07 - 93% - by Dec 07 Taking 8 best sites - 93% - by Jun 07 - 95% - by Dec 07 Tier-2s “Begin reporting the monthly averages, but do not set targets yet” ~80% - by Jun 07 ~90% - by Dec 07 SAM tests (critical=subset) BDII Top-level BDII sBDII Site BDII FTS File Transfer Service gCE gLite Computing Element LFC Global LFC VOMS CE Computing Element SRM gRB gLite Resource Broker MyProxy RB Resource Broker VOBOX VO BOX SE Storage Element RGMA RGMA Registry

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 The Year Ahead.. CMS 65% Tier-0 -> Tier-1 peak rate for a week Tier-1 -> each Tier-2 sustain 10 MB/s for 12 hours Each Tier-2 -> Tier-1 sustain 5 MB/s for 12 hours Aggregate for the Tier-1 SIMULTANEOUSLY: Tier-0 -> Tier-1 – 50% of average rate for 12 hours Tier-1 -> Tier-2s – 50% of sum of average rate for 12 hours Tier-2s -> Tier-1 – 50% of sum of average rate for 12 hours ATLAS 65% Tier-0 -> Tier-1 nominal rate for a week Tier-1 -> each Tier-2 sustain 65% nominal rate for 12 hours Each Tier-2 -> Tier-1 sustain 65% nominal rate for 12 hours Aggregate for the Tier-1 SIMULTANEOUSLY: Tier-0 -> Tier-1 – 100% of average rate for 12 hours Tier-1 -> Tier-2s – 50% of sum of average rate for 12 hours Tier-2s -> Tier-1 – 50% of sum of average rate for 12 hours

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Summary Exec 2 Summary – status OK 2006 Outturn – some issues The Year Ahead.. Some problems to solve.. (AKA challenges) The weather is fine We need to set some targets 2007

Tony Doyle - University of Glasgow GridPP18 Collaboration Meeting 20 March 2007 Devoncove Hotel 931 Sauchiehall Street, Glasgow, G3 7TQ Tel: 0141 334 4000 Sandyford Hotel, 904 Sauchiehall Street, Glasgow, G3 7TF Tel: 0141 334 0000 You can see ~home if you look up

GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle.

Similar presentations

Presentation on theme: "GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle.

Similar presentations

Presentation on theme: "GridPP Status “How Ready are We for LHC Data-Taking?” Tony Doyle."— Presentation transcript:

Similar presentations

About project

Feedback