Operating the World’s largest grid infrastructure

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

LCSC October The EGEE project: building a grid infrastructure for Europe Bob Jones EGEE Technical Director 4 th Annual Workshop on Linux.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
SA1/SA2 meeting 28 November The status of EGEE project and next steps Bob Jones EGEE Technical Director EGEE is proposed as.
INFSO-RI Enabling Grids for E-sciencE Integration and Testing, SA3 Markus Schulz CERN IT JRA1 All-Hands Meeting 22 nd - 24 nd March.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
CEOS WGISS-21 CNES GRID related R&D activities Anne JEAN-ANTOINE PICCOLO CEOS WGISS-21 – Budapest – 2006, 8-12 May.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Role and Challenges of the Resource Centre in the EGI Ecosystem Tiziana Ferrari,
JRA1 Middleware re-engineering
Bob Jones EGEE Technical Director
JRA2: Quality Assurance
Regional Operations Centres Core infrastructure Centres
Operations Status Report
LCG Service Challenge: Planning and Milestones
EGEE Middleware Activities Overview
JRA3 Introduction Åke Edlund EGEE Security Head
SA1 Execution Plan Status and Issues
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Data Challenge with the Grid in ATLAS
Christos Markou Institute of Nuclear Physics NCSR ‘Demokritos’
EGEE VO Management.
Long-term Grid Sustainability
EGEE support for HEP and other applications
Readiness of ATLAS Computing - A personal view
Infrastructure Support
The CCIN2P3 and its role in EGEE/LCG
LCG Operations Centres
LCG Operations Workshop, e-IRG Workshop
Connecting the European Grid Infrastructure to Research Communities
Operating the LCG and EGEE Production Grid for HEP
Leigh Grundhoefer Indiana University
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
LHC Data Analysis using a worldwide computing grid
EGI Webinar - Introduction -
Overview of the EGEE project and the gLite middleware
Future EU Grid Projects
Release and deployment process Working Group
GRIF : an EGEE site in Paris Region
gLite The EGEE Middleware Distribution
The LHCb Computing Data Challenge DC06
Presentation transcript:

Operating the World’s largest grid infrastructure Roberto Barbera University of Catania and INFN Grids@work Sophia Antipolis, 10.10.2005

Contents Scope an purpose of the activity Organisation Major tasks Interaction points

The EGEE Activities 24% Joint Research 28% Networking 48% Services JRA1: Middleware Engineering and Integration JRA2: Quality Assurance JRA3: Security JRA4: Network Services Development SA1: Grid Operations, Support and Management SA2: Network Resource Provision NA1: Management NA2: Dissemination and Outreach NA3: User Training and Education NA4: Application Identification and Support NA5: Policy and International Cooperation 24% Joint Research 28% Networking 48% Services Emphasis in EGEE is on operating a production Grid and on supporting the end-users.

Provide access to and operate a production grid infrastructure Overall Description Provide access to and operate a production grid infrastructure Different user communities -> multiple Vos Facilities in Europe and other collaborating sites Make best use of existing grid initiatives Build upon EGEE 1 experience What is needed to achieve this?

Key Objectives Core Infrastructure Services Monitoring and Control IS, data management, VO, (driven by Vos) Monitoring and Control Performance, operational state Initiate corrective actions Middleware Deployment Integrate, certify, package middleware components Support for new resources, setup and operation Feedback with middleware activities in and outside of EGEE User and resource support Receive problem reports Coordinate operational problem resolution Grid management Co-ordination of the implementation with the ROCs Negotiation of SLAs Keep in contact with the wider Grid community Liaison, participate in standard bodies International Collaboration Interoperability with large scale grids in the US and Asia-Pacific region Seamless access for the EGEE user community to resources Capture and provide requirements Relevant for operations, deployment and (some aspects of) security Follow-up

Contents Scope an purpose of the activity Organisation Major tasks Interaction points with

Organisation Simplification What happened to the CICs? EGEE 1 structure OMC, CICs, ROCs, RCs EGEE 2 (the future) Operations Coordination Centre Regional Operations Centres Resource Centers What happened to the CICs? All CICs are co-located with ROCs Some ROCs provide CIC services ===> Adjust the structure to current practice Basic ROCs and ROCs with CIC functions Easy transition, different set of services CIC RC ROC OMC

Operations Coordination Centre Core responsibilities Middleware integration, certification, distribution packs Coordinate: Deployment and support Grid operation and support User support activity Operational security activity SLAs (negotiate & monitor) Interoperability Non EGEE regions ROCs more focussed on national/regional grids Act as a ROC Current CIC functions (10+ RBs….) ROC for RCs in non EGEE regions Located at CERN

Regional Operations Centres I Support ALL sites in their region EGEE partners and friends Core Responsibilities (incomplete) ------> ALL ROCs 1st line user support (Call centre, regional training..) 1st line operational support (ROC “owns” operational problems) Coordination Deployment of middleware releases to its RCs With national and regional grid projects Regional Grid security (Incident responds teams (with RCs)) Negotiate resources for new VOs Manage SLAs Run infrastructure services Support EGEE production AND pre-production services

Regional Operations Centres II Additional Roles Who? Current CICs & ROCs with sufficient resources and expertise Operations management Operations Center on duty shifts Monitoring, management, troubleshooting Improve, develop and run tools User support management Coordinate Joint Security Policy Group (now @ RAL) Run additional grid services ( including VO specific) Collaborate in the release process Specific aspects of certification, porting, … Security vulnerability and risk analysis (NEW) Coordinate (partial) code reviews, best practice,… ROC concept can serve in non EGEE regions as an operation model

Contents Scope an purpose of the activity Organisation Major tasks Interaction points

Tasks I Overall: Middleware testing and certification Operate Production and Pre-Production Service Some tasks implicit described with ROCs and OCC roles Middleware testing and certification Where? Central coordination, some external contribution Expected Results Middleware distributions for production Select components from within and external sources Negotiate support Integration and testing could be a joint activity with JRA1 Testing needs to start from day 1 (sufficiently staffed) Certification Integrated system Co-existence/Interoperability Deployability, functionality, configuration, management of components Extended set of OSs Optional integration with Virtual Data Toolkit (VDT) --> ensures US interop.

Tasks II Testbeds Set of testbeds at CERN for rapid setup Regions contribute to well defined aspects Deployment tests MPI support Batch systems Ports to different architectures

Tasks III Middleware deployment and support OCC coordination, ROCs coordinate and support their RCs Expected Results Deploy agreed set to all sites Region can support supersets (but NOT subsets) Stick to agreed schedule Service Layers (new) Core services (CE, SE, Local Catalogues…) Long update cycles (1--> 2 times a year + security driven updates) At all sites Additional Services (Central Catalogues, IS, Monitoring, RBs) Not present at all sites (mainly some ROCs) Shorter update cycles (on demand?) Client tools on WNs Installed in user space New version made available by a central team VOs select preferred version Ongoing work on simplification of installation and configuration

Tasks IV Grid Operations and Support OCC & ROCs Expected Results Manage the grid operation Has been included in the description of the ROCs and OCC’s roles

Tasks V Grid security and incident responds Security Coordination Group Central coordination of incident response Lead by: EGEE Security Head (PEB member) + Middleware Security Architect Chair of the Joint Security Policy Group (SA1) Chair of the EUGridPMA Expected Results Coordination of security related aspects of: Architecture Deployment Operation Include standardization work

Tasks VI Security Coordination Group Grid security and incident responds Security Coordination Group Central coordination of incident response Coordinated at the OCC ROCs coordinate the incident responds in their region Requires resources at all RCs and ROCs Needs a strong mandate Expected Results Minimize security risks by fast responds Ensure best practice EGEE wide team to react on security incidents Members from ROCs/RCs

Tasks VII Support: Virtual Organizations, Applications, Users Central coordination at OCC and all ROCs Expected Results User support Distributed Each ROC provides front-line support for local users Each ROC contributes to the overall user support (experts) VOs provide user support VO filters problems Existing help desks at major centres should be integrated into the support structure Filter and inject problems into the grid support User Support Call centers and helpdesks ROCs Training VO support and integration NA4 with teams like the LCG-EIS We have currently not a good model for user support Some experience from LCG ( can this be mapped???) Needs resources from ROCs, OCC and VOs

Tasks VIII Grid Management Interoperation See OCC and ROCs roles ROC coordinator must have a strong presence at the OCC Interoperation ROCs focus on national/regional grids OCC non EGEE regions Coexistence and common policies have to be clarified NA4 has to participate in the definition of “seamless” Application <----> Resource Provider Coordination Some resources should be made available to (most) all applications This could become part of the SLAs (opportunistic usage?) Needs clarification Application <-> RC <-> Middleware Coordination SA1 needs to be part of this ROCs aggregate regional feedback Coordination ?

Contents Scope an purpose of the activity Organisation Major tasks Interaction points

Interactions JRA1 JRA2 NA4 NA3 SA2 Integration and testing Security Deployment and operational requirements Training JRA2 Work on QA metrics for operations Link of QA and monitoring NA4 Resource negotiation Production Middleware Stack definition User Support NA3 Receiving and providing training (SA1 has provided significant training) SA2 Link between network operation center and grid operations

EGEE Infrastructure Site Map In collaboration with LCG Grid3/OSG Status 25 July 2005 In collaboration with LCG NorduGrid Grid3/OSG Site Map

Grid monitoring Operation of Production Service: real-time display of grid operations Accounting Information Selection of Monitoring tools: GIIS Monitor + Monitor Graphs Sites Functional Tests GOC Data Base Scheduled Downtimes Live Job Monitor GridIce – VO + Fabric View Certificate Lifetime Monitor

Number of jobs processed per month (April 2004-April 2005) Service Usage VOs and users on the production service Active VOs: HEP: 4 LHC, D0, CDF, Zeus, Babar Biomed ESR (Earth Sciences) Computational chemistry Magic (Astronomy) EGEODE (Geo-Physics) Registered users in these VO: 800+ Many local VOs, supported by their ROCs Scale of work performed: LHC Data challenges 2004: >1 M SI2K years of CPU time (~1000 CPU years) 400 TB of data generated, moved and stored 1 VO achieved ~4000 simultaneous jobs (~4 times CERN grid capacity) Number of jobs processed per month (April 2004-April 2005)

EGEE infrastructure usage Average job duration January 2005 – June 2005 for the main VOs

Service Challenges – ramp up to LHC start-up service June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation Apr07 – LHC Service commissioned Full physics run 2005 2007 2006 2008 First physics First beams cosmics SC2 SC3 SC4 LHC Service Operation SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput

Why Service Challenges? To test Tier-0  Tier-1 Tier-2 services Network service Sufficient bandwidth: ~10 Gbit/sec Backup path Quality of service: security, help desk, error reporting, bug fixing, .. Robust file transfer service File servers File Transfer Software (GridFTP) Data Management software (SRM, dCache) Archiving service: tapeservers,taperobots, tapes, tapedrives, .. Sustainability Weeks in a row un-interrupted 24/7 operation Manpower implications: ~7 fte/site Quality of service: helpdesk, error reporting, bug fixing, .. Towards a stable production environment for experiments

SC2 met its throughput targets >600MB/s daily average for 10 days was achieved - Midday 23rd March to Midday 2nd April Not without outages, but system showed it could recover rate again from outages Load reasonable evenly divided over sites (give network bandwidth constraints of Tier-1 sites)

Straw-man migration plan to gLite Certification test-bed Core functionality tested, some stress tests. Threshold for moving to pre-production: Functionality of gLite at least that of LCG-2 The stability not worse than than (80)% of LCG2 on the same test-bed Performance: The core functions (job submission, file reg., file lookup, file delete, data movement..) should not be less than (50)% of LCG-2 Pre-production Thresholds as for the certification test-bed. In addition: scalability testing. Applications: Overall perceived usability has to be comparable with LCG-2 Once thresholds achieved: LCG-2 is frozen, except for security fixes. No porting to new OS releases. This ensures that LCG-2 will be phased out with current version of OS Introduction to Production: Major sites deploy gLite CEs in parallel with the LCG-2 CEs. WNs provide client libs for both stacks. Some of the smaller sites convert fully to gLite. Incrementally, until (50)% of the resources are available through gLite. Re-apply threshold tests as on pre-production (stricter?). Final steps Migrate catalogues and data (if needed). Takes ~3 months. All smaller sites convert to gLite. Larger sites continue to provide access to LCG-2 data However the LCG-2 SEs are made read-only to encourage migration of applications. Keeps LCG-2 as viable fall-back Avoids having to state a drop-dead date for LCG-2 – but sets conditions Provides migration environment for applications