GridPP: Executive Summary

Slides:



Advertisements
Similar presentations
Tony Doyle GridPP2 Proposal, BT Meeting, Imperial, 23 July 2003.
Advertisements

15 May 2006Collaboration Board GridPP3 Planning Executive Summary Steve Lloyd.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
D. Britton GridPP Status - ProjectMap 22/Feb/06. D. Britton22/Feb/2006GridPP Status GridPP2 ProjectMap.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Deployment Issues David Kelsey GridPP13, Durham 5 Jul 2005
Quarterly report ScotGrid Quarter Fraser Speirs.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.
GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.
Project Management Sarah Pearce 3 September GridPP21.
Tony Doyle - University of Glasgow 1 July 2005Oversight Committee GridPP: Executive Summary Tony Doyle.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Jeremy Coles UK LCG Operations The Geographical Distribution of GridPP Institutes Production Manager.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
11 March 2008 GridPP20 Collaboration meeting David Britton - University of Glasgow GridPP Status GridPP20 Collaboration Meeting, Dublin David Britton,
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
25th October 2006Tim Adye1 RAL Tier A Tim Adye Rutherford Appleton Laboratory BaBar UK Physics Meeting Queen Mary, University of London 25 th October 2006.
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
…building the next IT revolution From Web to Grid…
Tony Doyle - University of Glasgow 8 July 2005Collaboration Board Meeting GridPP Report Tony Doyle.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Bob Jones EGEE Technical Director
London Tier-2 Quarter Owen Maroney
Regional Operations Centres Core infrastructure Centres
Support Operation Challenge – 1 SOC-1 Alistair Mills Torsten Antoni
Connecting LRMS to GRMS
JRA3 Introduction Åke Edlund EGEE Security Head
SA1 Execution Plan Status and Issues
Design rationale and status of the org.glite.overlay component
Ian Bird GDB Meeting CERN 9 September 2003
Collaboration Meeting
Demand Management Overview Title Slide
Update on Plan for KISTI-GSDC
Introduction to Grid Technology
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Readiness of ATLAS Computing - A personal view
Nicolas Jacq LPC, IN2P3/CNRS, France
Olof Bärring LCG-LHCC Review, 22nd September 2008
Short update on the latest gLite status
Workshop Summary Dirk Duellmann.
WLCG Collaboration Workshop;
LCG Operations Workshop, e-IRG Workshop
Monitoring of the infrastructure from the VO perspective
GridPP13 Collaboration Meeting
LHC Data Analysis using a worldwide computing grid
Collaboration Board Meeting
UK MoUs and Tier-1/A experiment shares
The LHCb Computing Data Challenge DC06
Presentation transcript:

GridPP: Executive Summary Tony Doyle 1 July 2005 Oversight Committee

Exec2 Summary The GridPP Project is now underway and has met 21% of its original targets with 86% of the metrics within specification. This summary highlights various issues with respect to resource monitoring and project management, where the status of GridPP2 planning is described in the context of the “get fit” plan requested by the Oversight Committee. gLite 1 was released in April as planned but components have not yet been deployed or their robustness tested by the experiments. Service Challenge (SC) 2 addressing networking was a success at CERN and the Tier-1 and SC3 addressing file transfers for the experiments is about to commence. Hardware levels at the Tier-1 in 2007-08 and the deployment of Tier-2 resources are current concerns although these are mitigated by the present under-utilisation of resources. 1 July 2005 Oversight Committee

RAL joins labs worldwide in successful Service Challenge 2 The GridPP team at Rutherford Appleton Laboratory (RAL) in Oxfordshire recently joined computing centres around the world in a networking challenge that saw RAL transfer 60 terabytes of data over a ten-day period. A home user with a 512 kilobit per second broadband connection would be waiting 30 years to complete a download of the same size. 1 July 2005 Oversight Committee

gLite 1 1 July 2005 Oversight Committee

100 green sites sitting on a grid Thu 16 Jun 2005 Last week the UK CIC-on-duty team celebrated the milestone of having 100 sites passing the Sites Functional Test. Thanks to all the sites who acted promptly to trouble tickets raised by the UK team during their shift. 1 July 2005 Oversight Committee

Current concern 1. under-utilisation Under -utilisation of existing Tier-1/A resources improving overall and w.r.t. Grid fraction from 2004 to 2005 Non-Grid Grid 1 July 2005 Oversight Committee

Current concern 2. under-delivery The current situation is somewhat better than these 2005 Q1 numbers indicate Some late procurements (OK given under-utilisation) Technical problems (being overcome) 1 July 2005 Oversight Committee

GridPP Deployment Status 30/6/05 (9/1/05) Measurable Improvements totalCPU freeCPU runJob waitJob seAvail TB seUsed TB maxCPU avgCPU Total 2966 (2029) 1666 (1402) 843 (95) 31 (480) 74.28 (8.69) 16.54 (4.55) 3145 (2549) 2802 (1994) 1 July 2005 Oversight Committee

Actions GridPP to submit the proposal for LCG phase 2 funding to the Committee prior to its submission to Science Committee (minute 4.9). Done. 27 page report inc. input from OC at http://www.gridpp.ac.uk/docs/gridpp2/SC_GridPP2_LCG_1.0.doc unfunded GridPP to clarify the situation with regard to ATLAS production run tests for the next physics workshop (minute 5.3). See News Item http://www.gridpp.ac.uk/news/-1119651840.463358.wlg (and slide) GridPP to provide an update on progress resolving problems caused by mismatches between local batch systems and the capabilities of the grid Resource broker (minute 6.3). (See slide) GridPP to more fully document its alignment with each of the individual experiments (minute 15.2). An experiment engagement questionnaire has been used (initial input in February and further [updated] input in June). See http://www.gridpp.ac.uk/eb/workdoc/gridusebyexpts_0605.doc 1 July 2005 Oversight Committee

ATLAS steps up Grid production 1 July 2005 Oversight Committee

RB Action GridPP to provide an update on progress resolving problems caused by mismatches between local batch systems and the capabilities of the grid Resource broker (minute 6.3). The problem of connecting the local CE to a batch queue is largely overcome – many (all shared) sites now do this. There were problems subsequently deploying the accounting system (APEL) to point to the local batch system. Overcome (13 ex 18 sites), but not as straightforward as it could be. The JDL from the job is not passed to the local system. Hence there is no way for the local scheduler to use info from the Grid scheduler. This is a limitation from a (shared) site viewpoint (attempting to balance Grid and local jobs). The short term solution is to set up separate batch queues. It is not a limitation for the experiments (affects efficiency). It is noted as a requirement and it is intended that this will be delivered in Year 2 of JRA1 for the WMS. 1 July 2005 Oversight Committee

Actions GridPP to define its usage policy with respect to Tier-1 allocations (minute 15.4). See http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-57-Tier1A_1.0.doc and documents within (“fair shares” using PPARC Form X information) GridPP to produce an updated risk register (minute 15.5). Incorporated in the new Project Map at (with 7 “high” risks) http://www.gridpp.ac.uk/pmb/ProjectManagement/GridPP2_ProjectMap_2.htm GridPP to produce a “get-fit” plan for production metrics (minute 15.6). See Metrics and Deployment document http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-64-Metrics.doc and its incorporation into the Project Map GridPP to define its metrics for job success (minute 15.7). Adopted EGEE-wide definition at http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php (See slides) GridPP to produce a statement of intent regarding its adoption of gLite (minute 15.8). See Middleware Selection document http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-65-Middleware.doc 1 July 2005 Oversight Committee

Metrics Action GridPP to define its metrics for job success (minute 15.7). GridPP adopts the EGEE-wide definition at http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php The (web-based) QA system accounts for Workload Management System registered job successes (that can then be categorised by Virtual Organisation or Resource Broker) Before introducing the figures it should be understood that there are caveats: It only measures what the WMS “sees” doesn't catch failure of WMS to register job in the first place (but this is a rare occurrence) if a job half way through the script fails (for example tries but fails to copy a file) but the script completes successfully then WMS sees everything as OK. If a VO (e.g. LHCb) deploys an agent then the WMS only registers the success of the initial (python) script: strategy enables higher overall LHCb performance (combined push-PULL model). (This currently leads to other problems in overall accounting should contention become an issue). Overall: an end user may see either: 1. a worse efficiency failed job for other hidden e.g. data management problems 2. a better efficiency by choosing selected sites according to the Site Functional Test performance index; deploying an agent to initiate real jobs at sites where the agent succeeded. Physicists are “smart” and now “see” > 90% efficiency but the definition here is one defined within a given VO adopting their own methods (and from informed input from people currently submitting jobs to the system). 1 July 2005 Oversight Committee

Overview Integrated over all VOs and RBs for first half of 2005 Successes/Day 13806 Success % 64% Key point: Improving from 42% to 78% during 2005 [For the UK RB (lcgrb01.gridpp.rl.ac.uk) Successes/Day 319 Success % 69% ] 1 July 2005 Oversight Committee

LHC VOs ALICE ATLAS CMS LHCb 1 July 2005 Oversight Committee Successes/Day N/A 2796 452 3463 Success % 42% 83% 61% 68% 1 July 2005 Oversight Committee

Other VOs BaBar CDF D0 BioMed 1 July 2005 Oversight Committee Successes/Day 37 1 207 1074 Success % 76% 30% 84% 76% 1 July 2005 Oversight Committee

The “Get Fit” Plan Set SMART (Specific Measurable Achievable Realistic Time-phased) Goals 1 July 2005 Oversight Committee

“I take it plea bargaining is out of the question?” 1 July 2005 Oversight Committee

Our problems… 0.104: Number of LCG/EGEE job slots published by the UK. The current total is 2477 and the target was 3000. 0.106: GridPP KSI2K available: By the end of March 2005 the combined Tier-1 and Tier-2 CPU power was expected to be 5184 KSI2K compared to 2277 KSI2K achieved. This number is dominated by the 4397 KSI2K expected from the Tier-2s which has been slowly becoming available. 0.108: GridPP disk storage available: Similar to 0.106 above. Only 280TB available compared to 968TB anticipated but the situation is improving. 0.111: GridPP tape storage made available to LCG/EGEE. At present the tape storage is being used but not really via the Grid route. 0.112: Fraction of available KSI2K used in quarter: at present a rough estimate shows about 42% of the available CPU was used compared to a target value of 70%. 0.113: Fraction of available disk used in quarter: This is estimated at 64% compared to the target of 70%. 0.114: Fraction of available Tape used in quarter: This is estimated at 61% compared to the target of 70%. 0.131: Tier-1 service disaster recovery plans up to date: This has not been updated within the last 6 months. 0.143: Accumulated scheduled downtime in the last quarter: The current value of 418 days is almost identical to the current) target of 411 days. The metric expects the 25% figure to reduce to 5% by the third year. 3.6.3: LCG Deployment evaluation reports: first report due in March 05 was delayed to the second quarter. 5.2.4. Tier-2 Hardware realisation: This flags the same issue as 0.106 and 0.108 above. Tier-2 hardware has been delayed but the situation is improving. 5.2.7 Quarterly reports received within 1 month of the end of the quarter: The 05Q1 reports were received late. Some of the delay was due to the unfortunate timing of EGEE meetings. 6.2.11: Non-HEP applications tested on the GridPP Grid (submitted via the NGS submission mechanism). The NGS submission mechanism is not yet adequate. 1 July 2005 Oversight Committee

The “Get Fit” Plan … not (yet) “The Final Solution” We hope this drives the right behaviour Plea bargaining is (probably) OK.. 1 July 2005 Oversight Committee

Some Problem Solving Strategies 1 July 2005 Oversight Committee

Problem Solving and Improved Communication “Communication, in essence, is the shift of a particle from one part of space to another part of space. A particle is the thing being communicated. It can be an object, a written message, a spoken word or an idea. In its crudest definition, this is communication. This simple view of communication leads to the full definition: Communication is the consideration and action of impelling an impulse or particle from source-point across a distance to receipt-point, with the intention of bringing into being at the receipt-point a duplication and understanding of that which emanated from the source-point..” from The Scientology Handbook This may be a clue to how we will overcome our problems But we can always improve this.. 1 July 2005 Oversight Committee