Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tony Doyle GridPP – From Prototype To Production, HEPiX Meeting, Edinburgh, 25 May 2004.

Similar presentations


Presentation on theme: "Tony Doyle GridPP – From Prototype To Production, HEPiX Meeting, Edinburgh, 25 May 2004."— Presentation transcript:

1 Tony Doyle a.doyle@physics.gla.ac.uk GridPP – From Prototype To Production, HEPiX Meeting, Edinburgh, 25 May 2004

2 Tony Doyle - University of GlasgowOutline GridPP Project Introduction UK Context Components: A.Management B.Middleware C.Applications D.Tier-2 E.Tier-1 F.Tier-0 Challenges: –Middleware Validation –Improving Efficiency –Meeting Experiment Requirements –..Via The Grid? –Work Group Computing –Events.. To Files.. To Events –Software Distribution –Distributed Analysis Historical Perspective What is the Grid Anyway? Is GridPP a Grid? Summary

3 Tony Doyle - University of Glasgow GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and CERN Funded by the Particle Physics and Astronomy Research Council (PPARC) GridPP1 - Sept. 2001-2004 £17m "From Web to Grid" GridPP2 – Sept. 2004-2007 £16(+1)m "From Prototype to Production"

4 Tony Doyle - University of Glasgow UK Core e-Science Programme Institutes Tier-2 Centres CERN LCG EGEE GridPP GridPP in Context Tier-1/A Middleware, Security, Networking Experiments Grid Support Centre Not to scale! Apps Dev Apps Int GridPP

5 Tony Doyle - University of Glasgow GridPP1 Components LHC Computing Grid Project (LCG) Applications, Fabrics, Technology and Deployment European DataGrid (EDG) Middleware Development UK Tier-1/A Regional Centre Hardware and Manpower Grid Application Development LHC and US Experiments + Lattice QCD Management Travel etc

6 Tony Doyle - University of Glasgow GridPP2 Components C. Grid Application Development LHC and US Experiments + Lattice QCD + Phenomenology B. Middleware Security Network Development F. LHC Computing Grid Project (LCG Phase 2) [review] E. Tier-1/A Deployment: Hardware, System Management, Experiment Support A. Management, Travel, Operations D. Tier-2 Deployment: 4 Regional Centres - M/S/N support and System Management

7 Tony Doyle - University of Glasgow A. GridPP Management Collaboration Board Project Management Board Project Leader Project Manager Technical (Deployment) Board Experiments (User) Board (Production Manager) (Dissemination Officer) GGF, LCG, EDG (EGEE), UK e- Science, Liaison GridPP1 (GridPP2) Project Map Risk Register

8 Tony Doyle - University of Glasgow In LCG Context A. Management Structure ARDA Expmts EGEE LCG Deployment Board Tier1/Tier2, Testbeds, Rollout Service specification & provision User Board Requirements Application Development User feedback Metadata Workload Network Security Info. Mon. PMB CB Storage

9 Tony Doyle - University of Glasgow ARDA Expmts EGEE LCG Deployment Board Tier1/Tier2, Testbeds, Rollout Service specification & provision User Board Requirements Application Development User feedback MetadataWorkloadNetwork Security Info. Mon. PMB Storage III. Grid Middleware I. Experiment Layer II. Application Middleware IV. Facilities and Fabrics User Board Deployment Board GridPP2 Project Managing the Middleware B. Middleware, Security and Network Development

10 Tony Doyle - University of Glasgow B. Middleware, Security and Network Development M/S/N builds upon UK strengths as part of International development Configuration Management Storage Interfaces Network Monitoring Security Information Services Grid Data Management Security Middleware Networking

11 Tony Doyle - University of Glasgow C. Application Development GANGA SAMGrid Lattice QCD AliEn ARDA CMS BaBar

12 Tony Doyle - University of Glasgow D. UK Tier-2 Centres NorthGrid **** Daresbury, Lancaster, Liverpool, Manchester, Sheffield SouthGrid * Birmingham, Bristol, Cambridge, Oxford, RAL PPD, Warwick ScotGrid * Durham, Edinburgh, Glasgow LondonGrid *** Brunel, Imperial, QMUL, RHUL, UCL Current UK Status: 10 Sites via LCG

13 Tony Doyle - University of Glasgow D. The UK Testbed: Hidden Sector

14 Tony Doyle - University of Glasgow E. The UK Tier-1/A Centre High quality data services National and International Role UK focus for International Grid development LHCb ATLAS CMS BaBar April 2004: 700 Dual CPU 80TB Disk 60TB Tape (Capacity 1PB) Grid Operations Centre

15 Tony Doyle - University of Glasgow Real Time Grid Monitoring LCG2 24 May 2004

16 Tony Doyle - University of Glasgow E. Grid Operations Grid Operations Centre –Core Operational Tasks –Monitor infrastructure, components and services –Troubleshooting –Verification of new sites joining Grid –Acceptance tests of new middleware releases –Verify suppliers are meeting SLA –Performance tuning and optimisation –Publishing use figures and accounts –Grid information services –Monitoring services –Resource brokering –Allocation and scheduling services –Replica data catalogues –Authorisation services –Accounting services Grid Support Centre –Core Support Tasks –Running UK Certificate Authority

17 Tony Doyle - University of Glasgow F. Tier 0 and LCG: Foundation Programme Aim: build upon Phase 1 Ensure development programmes are linked Project management: GridPPLCG Shared expertise: LCG establishes the global computing infrastructure Allows all participating physicists to exploit LHC data Earmarked UK funding to be reviewed in Autumn 2004 Required Foundation: LCG Fabric, Technology and Deployment

18 Tony Doyle - University of Glasgow Tagged release selected for certification Certified release selected for deployment Tagged package Problem reports add unit tested code to repository Run nightly build & auto. tests Grid certification Fix problems Application Certification Build System Certification Testbed ~40CPU Application Testbed ~1000CPU Certified public release for use by apps. 24x7 Build system Test Group WPs Unit Test Build Certification Production Users Development Testbed ~15CPU Individual WP tests Integration Team Integration Overall release tests Releases candidate Tagged Releases Releases candidate Certified Releases Apps. Representatives Process to: Test frameworks Test support Test policies Test documentation Test platforms/compilers The Challenges Ahead I: Implementing the Validation Process

19 Tony Doyle - University of Glasgow The Challenges Ahead II: Improving Grid Efficiency

20 Tony Doyle - University of Glasgow The Challenges Ahead III: Meeting Experiment Requirements (UK) Total Requirement: In International Context - Q2 2004 LCG Resources:

21 Tony Doyle - University of Glasgow Dynamic Grid Optimisation over JANET Network 2004 2007 ~7,000 1GHz CPUs ~30,000 1GHz CPUs ~400 TB disk~2200 TB disk (note x2 scale change) The Challenges Ahead IV: Using (Anticipated) Grid Resources

22 Tony Doyle - University of Glasgow The Challenges Ahead V: Work Group Computing

23 Tony Doyle - University of Glasgow The Challenges Ahead VI: Events.. to Files.. to Events RAW ESD AOD TAG Interesting Events List RAW ESD AOD TAG RAW ESD AOD TAG Tier-0(International) Tier-1(National) Tier-2(Regional) Tier-3(Local) Data Files Data Files Data Files TAG Data Files Data Files Data Files RAW Data File Data Files Data Files ESD Data Files Data Files AOD Data Event 1 Event 2 Event 3 VOMS-enhanced Grid certificates to access databases via metadata Non-Trivial..

24 Tony Doyle - University of Glasgow The Challenges Ahead VII: software distribution ATLAS Data Challenge (DC2) this year to validate world-wide computing model Packaging, distribution and installation: Scale: one release build takes 10 hours produces 2.5 GB of files Complexity: 500 packages, Mloc, 100s of developers and 1000s of users –ATLAS collaboration is widely distributed: 140 institutes, all wanting to use the software –needs push-button easy installation.. Physics Models Monte Carlo Truth Data MC Raw Data Reconstruction MC Event Summary Data MC Event Tags Detector Simulation Raw Data Reconstruction Data Acquisition Level 3 trigger Trigger Tags Event Summary Data ESD Event Summary Data ESD Event Tags Calibration Data Run Conditions Trigger System Step 1: Monte Carlo Data Challenges Step 1: Monte Carlo Data Challenges Step 2: Real Data

25 Tony Doyle - University of Glasgow Complex workflow… LCG/ARDA Development 1.AliEn (ALICE Grid) provided a pre- Grid implementation [Perl scripts] 2.ARDA provides a framework for PP application middleware The Challenges Ahead VIII: distributed analysis

26 Tony Doyle - University of Glasgow Historical Perspective I wrote in 1990 a program called "WorlDwidEweb", a point and click hypertext editor which ran on the "NeXT" machine. This, together with the first Web server, I released to the High Energy Physics community at first, and to the hypertext and NeXT communities in the summer of 1991. Tim Berners-Lee The first three years were a phase of persuasion, aided by my colleague and first convert Robert Cailliau, to get the Web adopted… We needed seed servers to provide incentive and examples, and all over the world inspired people put up all kinds of things… Between the summers of 1991 and 1994, the load on the first Web server ("info.cern.ch") rose steadily by a factor of 10 every year…

27 Tony Doyle - University of Glasgow What is The Grid Anyway? From Particle Physics Perspective The Grid is: not hype, but surrounded by it a working prototype running on testbed(s)… about seamless discovery of PC resources around the world using evolving standards for interoperation the basis for particle physics computing in the 21 st Century not (yet) as transparent as end-users want it to be

28 Tony Doyle - University of Glasgow What is The Grid Is GridPP a Grid? Anyway? 1.Coordinates resources that are not subject to centralized control 2.… using standard, open, general-purpose protocols and interfaces 3.… to deliver nontrivial qualities of service 1.YES. This is why development and maintenance of a UK-EU-US testbed is important 2.YES... Globus/CondorG/EDG meet this requirement. Common experiment application layers are also important here. 3.NO(T YET)… Experiments define whether this is true - currently only ~100,000 jobs submitted via the testbed c.f. internal component tests of up 10,000 jobs per day. Next step: LCG-2 deployment outcome… this year http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf

29 Tony Doyle - University of Glasgow GridPP – Theory and Experiment UK GridPP started 1/9/01 EU DataGrid: First Middleware ~1/9/01 Development requires a testbed with feedback –Operational Grid Fit into UK e-Science structures Experience in distributed computing essential to build and exploit the Grid Scale in UK? 0.5 PBytes and 2,000 distributed CPUs GridPP in Sept 2004 Grid jobs are being submitted now.. user feedback loop is important.. All experiments have immediate requirements Current Experiment Production: The Grid is a small component Non-technical issues: –Recognising context –Building upon expertise –Defining roles –Sharing resources Major deployment activity is LCG –We contribute significantly to LCG and our success depends critically on LCG Production Grid will be difficult to realise: GridPP2 planning underway as part of LCG/EGEE Many Challenges Ahead.. GridPP Summary: From Web to Grid

30 Tony Doyle - University of Glasgow GridPP Summary: From Prototype to Production BaBar D0 CDF ATLAS CMS LHCb ALICE 19 UK Institutes RAL Computer Centre CERN Computer Centre SAMGrid BaBarGrid LCG EDG GANGA EGEE UK Prototype Tier-1/A Centre CERN Prototype Tier-0 Centre 4 UK Tier-2 Centres LCG UK Tier-1/A Centre CERN Tier-0 Centre 2007 2004 2001 4 UK Prototype Tier-2 Centres ARDA Separate Experiments, Resources, Multiple Accounts 'One' Production Grid Prototype Grids

31 Tony Doyle - University of Glasgow Why was the failure rate ~20%? Component Testing e.g. RB Stress Tests (LCG) RB never crashed ran without problems at load for several days in a row 20 streams with 100 jobs each ( typical error rate ~ 2 % still present) RB stress test in a job storm of 50 streams, 20 jobs each : –50% of the streams ran out of connections between UI and RB. (configuration parameter – but machine constraints) –Remaining 50% streams finished normal (2% error rate) –Time between job-submit and return of the command (acceptance by the RB) is 3.5 seconds (independent of number of streams) PROBLEMS ARE END-TO-END: e.g. Site advertisement communicated via class ads to all sites (inc. e.g. CNAF) results in RB sending application jobs (e.g. AliEn for ALICE) to black hole – these are recorded as failures (application corrects for these via re-submission) OTHER PROBLEM IS INCORPORATION OF ADDED FUNCTIONALITY –~Resolved by adherence to software process coupled to testbed structure… improved significantly within LCG (leading to EGEE) III. Grid Middleware I. Experiment Layer II. Application Middleware IV. Facilities and Fabrics

32 Tony Doyle - University of Glasgow What is the GridPP1 Project Status? 76% of the 190 GridPP1 tasks have been successfully completed


Download ppt "Tony Doyle GridPP – From Prototype To Production, HEPiX Meeting, Edinburgh, 25 May 2004."

Similar presentations


Ads by Google