Download presentation
Presentation is loading. Please wait.
Published byFelix Nash Modified over 9 years ago
1
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations
2
22 October 2004 2 Grid Operations: Scope of Responsibilities Certification activities Certification of middleware as a coherent set of services Preparing that package for deploying Operational and support activities Coordinating and supporting the deployment to collaborating computer centres Coordinating Grid Operations activities Providing Operational support Providing Operational security support Providing User support CA management VO registration and management Policy CA and user registration policies Operational policy Security policies Resource usage and access policies
3
RAL IN2P3 FNAL Tier-1 USC …. Krakow CIEMAT Rome Taipei LIP CSCS Legnaro UB IFCA IC MSU Prague Budapest Cambridge IFIC NIKHEF TRIUMF CNAF FZK BNL PIC ICEPP Nordic …. Tier-2 small centres desktops portables Tier-2 – –Well-managed, grid- enabled disk storage –End-user analysis – batch and interactive –Simulation LHC Computing Model (simplified!!) Tier-0 – the accelerator centre –Filter raw data reconstruction event summary data (ESD) –Record the master copy of raw and ESD Tier-1 – –Managed Mass Storage – permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service –Data-heavy (ESD-based) analysis –Re-processing of raw data –National, regional support –“online” to the data acquisition process high availability, long-term commitment
4
last update 30/11/2015 22:47 LCG LCG-2 25 Universities 4 National Labs 2800 CPUs Grid3 30 sites 3200 cpus Total: 78 Sites ~9000 CPUs 6.5 PByte Total: 78 Sites ~9000 CPUs 6.5 PByte
5
22 October 2004 5 Operations services for LCG Operational support Hierarchical model CERN acts as 1 st level support for the Tier 1 centres Tier 1 centres provide 1 st level support for associated Tier 2s –Tier 1 “Primary sites” Grid Operations Centres (GOC) Provide operational monitoring, troubleshooting, coordination of incident response, etc. RAL (UK) led sub-project to prototype a GOC 2 nd GOC in Taipei now in prototype User support Central model FZK provides user support portal –Problem tracking system web-based and available to all LCG participants Experiments provide triage of problems CERN team provide in-depth support and support for integration of experiment sw with grid middleware
6
22 October 2004 6 Support Teams within LCG CERN Deployment Support (CDS) Middleware Problems 4 LHC experiments (Alice Atlas CMS LHCb) Other Communities (VOs) 4 non-LHC experiments (BaBar CDF Compass D0) Grid Operations Center (GOC) Operations Problems Resource Centers (RC) Hardware Problems Experiment Specific User Support (ESUS) Software Problems Global Grid User Support (GGUS) Single Point of Contact Coordination of User Support
7
22 October 2004 7 Experiences in deployment LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging, installation, and configuration tools and procedures A lot of effort had to be invested in this area There are many problems – but in the end we are quite successful System is stable and reliable System is used in production System is reasonably easy to install now – 60 sites Now have a basis on which to incrementally build essential functionality This infrastructure forms the basis of the initial EGEE production service
8
22 October 2004 8 LCG Operations EGEE Operations
9
22 October 2004 9 What is EGEE ? (I) EGEE (Enabling Grids for Escience in Europe) is a seamless Grid infrastructure for the support of scientific research, which: Integrates current national, regional and thematic Grid efforts Provides researchers in academia and industry with round-the-clock access to major computing resources, independent of geographic location Applications Geant network Grid infrastructure
10
22 October 2004 10 What is EGEE ? (II) 70 leading institutions in 28 countries, federated in regional Grids 32 M Euros EU funding (2004-5), O(100 M) total budget Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled) ~ 300 persons
11
22 October 2004 11 EGEE Activities Emphasis on operating a production grid and supporting the end-users 48 % service activities (Grid Operations, Support and Management, Network Resource Provision) 24 % middleware re-engineering (Quality Assurance, Security, Network Services Development) 28 % networking (Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation)
12
22 October 2004 12 LCG and EGEE Operations EGEE is funded to operate and support a research grid infrastructure in Europe The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination, management, and 2 nd level support Support activities are expanded with the provision of Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC)
13
22 October 2004 13 User support: Becomes hierarchical Through the Regional Operations Centres (ROC) Act as front-line support for user and operations issues Provide local knowledge and adaptations Coordination: At CERN (Operations Management Centre) and CIC for HEP Operational support: The LCG GOC is the model for the EGEE CICs CIC’s replace the European GOC at RAL Also run essential infrastructure services Provide support for other (non-LHC) applications Provide 2 nd level support to ROCs LCG EGEE in Europe
14
22 October 2004 14Summary Data challenges – demonstrated: Many m/w functional and performance issues (documented) Main problem is service stability Site fabric management, configuration, change control Etc Grid3 report similar problems … User support process needs improvement Now moving into continuous production + service & data challenges
15
22 October 2004 15 How to move forward – 1 Build an agreed operations model for the next year Should be able to evolve Operations/Fabric workshop Nov 2 – 4 Hepix ½ day – input from some sites and Grid3/OSG on their plans Documenting use-cases (based on experience), propose support mechanisms for each EGEE SA1 infrastructure 5 working groups: Operations support User support Operational security Fabric management issues SW needs and tools requirements from operations Need fabric management training for many sites
16
22 October 2004 16 Some issues Resource Centres: Large sites – have operations staff and/or on-call support Small sites – have no on-call and often little support at all Regional Operations Centres: Probably do not provide after-hours or on-call support. If this were the case then the model of support could more include the ROCs. However, it is clear that most ROCs will not have this level of support. Core Infrastructure Centres: Must have on-call support after-hours To be rotated through the 4 or 5 active CICs Thus, a basic question to answer is how much power or control can the CICs have in order to deal with problems when staff at RCs and ROCs are not available? Either CICs have rights to manage critical services on sites where there is no support, or Have the right to remove “broken” sites and services from the infrastructure. Likely that we have all combinations of these …
17
22 October 2004 17 Immediate actions Weekly operations meeting (Monday afternoon) Weekly reports from ROCs, CICs, other Tier 1s etc Operations Manager – Role rotates through 4 EGE CIC’s – manage problem reporting and follow up Hand over responsibility in weekly meeting Operational security team Being set up – led by Ian Neilson, strong collaboration between US and Europe on these issues.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.