GridPP Deployment Status GridPP15 Jeremy Coles 11 th January 2006.

Slides:



Advertisements
Similar presentations
Deployment metrics and planning (aka Potentially the most boring talk this week) GridPP16 Jeremy Coles 27 th June 2006.
Advertisements

LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.
Quarterly report ScotGrid Quarter Fraser Speirs.
S.L.LloydGridPP CB 29 Oct 2002Slide 1 Agenda 1.Introduction – Steve Lloyd 2.Minutes of Previous Meeting (23 Oct 2001) 3.Matters Arising 4.Project Leader's.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
NorthGrid status Alessandra Forti Gridpp13 Durham, 4 July 2005.
London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
Quarterly report ScotGrid Quarter Fraser Speirs.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.
Steve Traylen Particle Physics Department EDG and LCG Status 9 th December 2003
John Gordon CCLRC e-Science Centre LCG Deployment in the UK John Gordon GridPP10.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Jeremy Coles UK LCG Operations The Geographical Distribution of GridPP Institutes Production Manager.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005.
Production Manager’s Report PMB Jeremy Coles 13 rd September 2004.
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
LCG Storage Accounting John Gordon CCLRC – RAL LCG Grid Deployment Board September 2006.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
The GridPP DIRAC project DIRAC for non-LHC communities.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
LCG User Level Accounting John Gordon CCLRC-RAL LCG Grid Deployment Board October 2006.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
1/3/2006 Grid operations: structure and organization Cristina Vistoli INFN CNAF – Bologna - Italy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
LCG Accounting Update John Gordon, CCLRC-RAL 10/1/2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Regional Operations Centres Core infrastructure Centres
LCG Service Challenge: Planning and Milestones
Ian Bird GDB Meeting CERN 9 September 2003
Site availability Dec. 19 th 2006
Presentation transcript:

GridPP Deployment Status GridPP15 Jeremy Coles 11 th January 2006

Overview 2 Some new sources of information 3 General deployment news 4 Expectations for the coming months 5 Preparing for SC4 and our status 6 Summary 1 An update on some of the high-level metrics

Prototype metric report UKI is still contributing well but according to the SFT data our proportion of sites failing is relatively high

Snapshot of recent numbers Region# sitesAverage CPU Asia Pacific8450 CERN Central Europe1530 France81100 Germany & Switzerland Italy Northern Europe Russia14430 South East Europe20370 South West Europe14700 UKI Most of the unavailable sites have been in Ireland as they make the move over to LCG

Average job slots have increased gradually UK job slots have increased by about 10% since GridPP14. (See Steve Lloyd’s talk for how this looks against the targets)

Therefore our contribution to EGEE CPU resources remains at ~20%

However there still is not consistently high usage of job slots

The largest GridPP users by VO for 2005 LHCb ATLAS BABAR CMS BIOMED DZERO ZEUS NB: Excludes data from Cambridge – for Condor support in APEL see Dave Kant’s talk

Storage has seen a healthy increase – but usage remains low At the GridPP Project Management and Deployment Boards yesterday we discussed ways to encourage the experiments to make more use of Tier-2 disk space – The Tier-1 will be unable to meet allocation requests. One of the underlying concerns is what do data flags mean to Tier-2 sites.

Scheduled downtime Views of data will be available from CIC portal from today!

Scheduled downtime Congratulations to Lancaster for being the only site to have no Scheduled Downtime

SFT review It was probably clear already that the majority of our failures (and those of other large ROCs) are lcg-rm (Failure points: replica catalog, configured BDII, CERN storage for replication OR a local SE problem) and rmga (generally badly configured site). We know the tests need to improve and become more reliable and accurate too!

Overall EGEE statistics The same problems cover the majority of EGEE resources. Hours of impact will be available soon and will help us evaluate the true significance of these results.

Completing the weekly reports ALL site administrators have now been asked to complete information related to problems observed at their sites as recorded in the weekly operations report This will impact our message at weekly EGEE operations reviews ( and YOUR Tier-2 performance figures!

Performance measures The GridPP Oversight Committee has asked us to investigate why some sites perform better than others. As well as looking at the SFT and ticket response data, the Imperial College group will help pull data from their LCG2 Real Time Monitor Daily Summary Reports:

SUPPORT User ticket response time Number of “supporters” # tickets escalated % tickets wrongly assigned ROC measures SERVICE NODES (testing) RB – submit to CE time BDII – query time average MyProxy – register/access/del SRM-SE – test file movement Catalogue test VOMS RGMA EGEE metrics While we have defined GridPP metrics many are not automatically produced. EGEE now has metrics as a priority and at EGEE3 a number of metrics were agreed for the project and assigned. SIZE # of sites in production # of job slots Total available kSpecInt Storage (disc) Mass storage # EGAP approved VOs # active VOs # active users Total % used resources DEPLOYMENT Speed of m/w security update OPERATIONS Site responsiveness to COD Site response to tickets Site tests failed % availability of SE, CE # days downtime per ROC USAGE Jobs per VO (submit, comp, fail) Data transfer per VO CPU and storage usage per VO % sites blacklisted/whitelisted # CE/SE available to VO

The status of Mon 9th Jan - tag and begin local testing of installations and upgrades on mini testbeds, complete documentation Mon 16th Jan - pre-release to >3 ROCs for a week of further testing Mon 23rd Jan - incorporate results of ROC testing and release asap Release (at the end of January!?) Expect Bug fixes – RB, BDII, CE, SE-classic, R-GMA, GFAL, LFC, SE_DPM VOMS – new server client version VO-BOX – various functionality changes LFC/DPM updates Lcg_utils/GFAL – new version & bug fixes RB – new functionality for job status checking Security issues – pool account recycling, signed rpm distribution FTS clients & dCache Some “VO management via YAIM” additions Details of release work:

Outcomes of security challenge Comments (Thanks to Alessandra Forti) Test suites should be asynchronous Security contacts mail list is not up to date 4 sites CSIRTS did not pass on information – site security contacts should be the administrators and not site CSIRTS 1 site did not understand what to do Majority of sites acknowledged tickets within a few hours once site administrator received ticket On average sites responded with CE data in less that 2 days (some admins were unsure about contacting the RB staff) 2 sites do not use lcgpbs jobmanager and were unable to find the information in the log files (also 1 using Condor) Some sites received more than one SSC job in 3 hr timeframe and were unable to return an exact answer but gave several Mistake in date – admins spotted inconsistencies ROC struggled with ticket management and caused delays in processing tickets! Aside: The EGEE proposed Security Incident handbook is being reviewed by the deployment team:

Other areas of interest! The Footprints version (UKI ROC ticketing system) will be upgraded on 23 rd January. This will improve our interoperations with GGUS and other ROCs (using xml s). There should be little observable impact but we do ask PLEASE SOLVE & CLOSE as many currently open tickets as possible by 23 rd January. Culham (the place which hosted the last operations workshop) will be adding a new UKI site in the near future. They will join or host the Fusion VO. Most sites have now completed the “10 Easy Network Questions” responses. This has proved a useful exercise. What do you think? The deployment team has identified a number of operational areas to improve. These include such things as experiment software installation, VO support availability of certain information on processes (like where to start for new sites) Pre-production service: UKI now has 3 sites with gLite (components) either deployed or in the process of being deployed

Other areas of interest! The Footprints version (UKI ROC ticketing system) will be upgraded on 23 rd January. This will improve our interoperations with GGUS and other ROCs (using xml s). There should be little observable impact but we do ask PLEASE SOLVE & CLOSE as many currently open tickets as possible by 23 rd January. Culham (the place which hosted the last operations workshop) will be adding a new UKI site in the near future. They will join or host the Fusion VO. Most sites have now completed the “10 Easy Network Questions” responses. This has proved a useful exercise. What do you think? The deployment team has identified a number of operational areas to improve. These include such things as experiment software installation, VO support availability of certain information on processes (like where to start for new sites) Pre-production service: UKI now has 3 sites with gLite (components) either deployed or in the process of being deployed REMINDER & REQUEST – Please enable more VOs! GridPP PMB requests that 0.5% (1% in EGEE-2) resources be used to support wider VOs – like BioMed. This will also get our utilisation higher. Feedback is going to developers on making adding VOs easier.

Our focus is now on Service Challenge 4 GridPP links, progress and status is being logged in the GridPP wiki: SRM 80% of sites have working (file transfers with 2 other sites successful) SRM by end of December All sites have working SRM by end of January 40% of sites (using FTS) able to transfer files using an SRM 2.1 API by end February All sites (using FTS) able to transfer files using an SRM 2.1 API by end March Interoperability tests between SRM versions at Tier-1 and Tier-2s (TBC) FTS Channels FTS channel to be created for all T1-T2 connections by end of January FTS client configured for 40% sites by end January FTS channels created for one Intra-Tier-2 test for each Tier-2 by end of January FTS client configured for all sites by end March A number of milestones (previously discussed at the 15 th November UKI Monthly Operations Meeting) have been set. Red in text means milestone at risk (generally due to external dependencies) and Green text signifies done.

Core to these are the… Data Transfers Tier-1 to Tier-2 Transfers (Target rate Mb/s) Sustained transfer of 1TB data to 20% sites by end December Sustained transfer of 1TB data from 20% sites by end December Sustained transfer of 1TB data to 50% sites by end January Sustained transfer of 1TB data from 50% sites by end January Sustained individual transfers (>1TB continuous) to all sites completed by mid-March Sustained individual transfers (>1TB continuous) from all sites by mid-March Peak rate tests undertaken for all sites by end March Aggregate Tier-2 to Tier-1 tests completed at target rate (rate TBC) by end March Inter Tier-2 Transfers (Target rate 100 Mb/s) Sustained transfer of 1TB data between largest site in each Tier-2 to that of another Tier-2 by end February Peak rate tests undertaken for 50% sites in each Tier-2 by end February

The current status RAL Tier-1LancasterManchesterEdinburghGlasgowIC-HEPRAL-PPD RAL Tier-1 ~800Mb/s350Mb/s156Mb/s84Mb/s309 Mb/s 397 Mb/s Lancaster 0 Mb/s Manchester Edinburgh 422Mb/s210 Mb/s 224 Mb/s Glasgow 331Mb/s122 Mb/s IC-HEP RAL-PPD Receiving NEXT SITES: London – RHUL & QMUL ScotGrid – Durham SouthGrid – Birmingham & Oxford? NorthGrid – Sheffield? & Liverpool KEY: Black figures indicate 1TB transfer Blue figures indicate <1TB transfer (eg. 10 GB)

Additional milestones LCG File Catalog LFC document available by end November LFC installed at 1 site in each Tier-2 by end December LFC installed at 50% sites by end January LFC installed at all sites by end February Database update tests (TBC) VO Boxes Depending on experiment responses to security and operations questionnaire and GridPP position on VO Boxes. VOBs available (for agreed VOs only) for 1 site in each Tier-2 by mid-January VOBs available for 50% sites by mid-February VOBs available for all (participating) sites by end March Experiment Specific Tests (TBC) To be developed in conjunction with experiment plans – Please make suggestions!

LCG File Catalog LFC document available by end November LFC installed at 1 site in each Tier-2 by end December LFC installed at 50% sites by end January LFC installed at all sites by end February Database update tests (TBC) VO Boxes Depending on experiment responses to security and operations questionnaire and GridPP position on VO Boxes. VOBs available (for agreed VOs only) for 1 site in each Tier-2 by mid-January VOBs available for 50% sites by mid-February VOBs available for all (participating) sites by end March Experiment Specific Tests (TBC) To be developed in conjunction with experiment plans – Please make suggestions! Additional milestones LHCb & ALICE questionnaires received. Accepted and VO boxes deployed at Tier-1.Little use so far – ALICE has not had a disk allocation. ATLAS original response was not accepted. They have since tried to implement VO boxes and found problems so are now looking at a centralised model. CMS do not have VO Boxes but they DO require local VO persistent processes

Getting informed & involved! The deployment team are working to make sure sites have sufficient information. Coordinate your activities with your Tier-2 Coordintor. 1)Stay up to date via the Storage Group work: 2) General Tier-1 support: 3) Understand and setup FTS (channels): 4) VO Boxes go via Tier-1 first: 5) Catalogues (& data management): The status of sites is being tracked here: Some particular references worth checking out when taking the next step: 6) What RALPP did to get involved: 8) Edinburgh dCache tests: 9) Glasgow DPM testing: PLEASE CREATE SITE TESTING LOGS – it helps with debugging and information sharing

Summary 2 EGEE work will add to information which is published & analysed 3 GridPP & experiments need to work at better use of Tier-2 disk 4 There are changes coming with & helpdesk upgrade 6 Sites asked to complete reports, reduce tickets & get involved in SC4! 1 Metrics show stability and areas where we can improve 5 Focus has shifted to Service Challenge work (including security)