Download presentation
Presentation is loading. Please wait.
Published byAriel Strickland Modified over 9 years ago
1
EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite Barroso Lopez CERN SA1 Deputy Activity Leader 1 st EELA conference Santiago de Chile, 4 th September 2006
2
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 2 Outline EGEE – SA1/SA3 EGEE infrastructure – status Grid Operations User Support Security & Policy Summary SA: 54% of total SA1 (operations) : 86% SA2 (network) : 3% SA3 (certification): 11% SA: 54% of total SA1 (operations) : 86% SA2 (network) : 3% SA3 (certification): 11%
3
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 3 A global, federated e-Infrastructure EUIndiaGrid EUMedGrid SEE-GRID EELA BalticGrid EUChinaGrid OSG NAREGI EGEE: > 192 sites, 40 countries, 11 ROCs > 28,000 processors ~ 2500 TB storage > 20 000 concurrent jobs per day
4
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 4 Infrastructure status
5
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 5 Some statistics ~6000 cpu-months/month
6
EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Grid Operations
7
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 7 EGEE Operations Structure Operations Coordination Centre (OCC) Regional Operations Centres (ROC) –Front-line support for user and operations issues –Provide local knowledge and adaptations –One in each region – many distributed (inc. A-P) –Manage daily grid operations – oversight, troubleshooting “Operator on Duty” –Run infrastructure services User Support Centre (GGUS) –In FZK: provide single point of contact (service desk) + portal.
8
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 8 EGEE Operations Process Grid operator on duty Grid monitoring tools Geographically distributed responsibility for operations: –There is no “central” operation –Tools are developed/hosted at different sites: GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) Procedures described in Operations Manual –Linked from the CIC portal https://edms.cern.ch/document/701575
9
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 9 Grid Operator on Duty Role: –Watch the problems detected by the grid monitoring tools –Problem diagnosis –Report these problems (GGUS tickets) –Follow and escalate them if needed (well defined procedure) –Provide help, propose solutions –Build and maintain a central knowledge database (WIKI) Who does it?: –9 ROC teams working in pairs (one lead and one backup) on a weekly rotation –CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern- Europe, Central-Europe, Germany-Switzerland
10
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 10 Grid monitoring tools Tools used by the Grid Operator on Duty team to detect problems Distributed responsibility CIC portal –single entry point –Integrated view of monitoring tools Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor
11
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 11 Site Functional Tests Site Functional Tests (SFT) –Framework to test (sample) services at all sites –Shows results matrix –Detailed test log available for troubleshooting and debugging –History of individual tests is kept –Can include VO-specific tests (e.g. sw environment) –Normally >80% of sites pass SFTs NB of 190 sites, some are not well managed Very important in stabilising sites: Apps use only good sites Bad sites are automatically excluded Sites work hard to fix problems
12
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 12 Service Availability Monitoring Service Availability Monitoring (SAM) Will cover all grid core services measure availability by service, site, VO each service has associated service class defining required availability (Critical, highly available, etc.) Will be used to generate alarms generate trouble tickets call out support staff
13
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 13 Regional Operations Centre …… Resource Centre Resource Centre … Regional Operations Centre Resource Centre Resource Centre … OSCT Grid Operator on-duty 2 nd Level support 1 st Level support Monitoring shows a problem Operator submits a GGUS ticket against the ROC and cc’s the site. The ticket is followed until it is solved ROC and Site work to resolve the problem Operations support workflows
14
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 14 Escalation procedures Action takenLownormalhigh 1st mail to site admin and ROC 3 days 1 day 2nd mail to ROC3 days 1 day phone call to ROC3 days 1 day final mail to ROC immediate weekly operations meeting call asap Mail to OCC for validation asap site suspension asap
15
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 15 Site related procedures Introducing a new site – The ROC is the contact point – ROC registers the site and sets the Initial site status to uncertified – After SFTs run OK for a week -> certified Site downtime scheduling –EGEE resources need to be switched off properly in order not to disturb operations –Set downtime period in GOCDB and tick off “monitoring” for the affected nodes –Announce the downtime through the EGEE broadcast tool –“GlueCEStateStatus: Closed” Required site contacts –Stored in GOCDB Suspending a site –The site is then removed from the top-level BDII and monitoring is turned off
16
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 16 Operations coordination ROC managers meeting –Biweekly –Minutes: https://edms.cern.ch/document/753088 https://edms.cern.ch/document/753088 –Discuss inter-ROC issues, general coordination, interfaces with other activities Operations meeting –Weekly, Mondays at 16:00 (Swiss time) –Agendas, minutes: http://agenda.cern.ch/displayLevel.php?fid=258 http://agenda.cern.ch/displayLevel.php?fid=258 –WLCG/OSG/EGEE –Pre-reports from sites, ROCs and VOs through CIC portal –Discuss, track and solve operation related issues from the previous week Operation Workshops –Twice per year. –Next one: Spring 2007 –Agenda of last one: http://agenda.cern.ch/fullAgenda.php?ida=a062031 http://agenda.cern.ch/fullAgenda.php?ida=a062031
17
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 17 Checklist for a new service User support procedures (GGUS) –Troubleshooting guides + FAQs –User guides Operations Team Training –Site admins –CIC personnel –GGUS personnel Monitoring –Service status reporting –Performance data Accounting –Usage data Service Parameters –Scope - Global/Local/Regional –SLAs –Impact of service outage –Security implications Contact Info –Developers –Support Contact –Escalation procedure to developers Interoperation –Documented issues First level support procedures –How to start/stop/restart service –How to check it’s up –Which logs are useful to send to CIC/Developers and where they are SFT Tests –Client validation –Server validation –Procedure to analyse these error messages and likely causes Tools for ROC to spot problems –GIIS monitor validation rules (e.g. only one “global” component) –Definition of normal behaviour Metrics ROC Dashboard –Alarms Deployment Info –RPM list –Configuration details –Security audit This is what is takes to make a reliable production service from a middleware component Not much middleware is delivered with all this … yet
18
EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks User Support
19
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 19 User support in EGEE Global Grid User Support (GGUS) is the EGEE support infrastructure for Grid users, deployment and operation problems It offers a large variety of services to satisfy user needs at all levels It does not substitute but integrate existing infrastructures and coordinates support efforts
20
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 20 Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal The Support Model “Regional Support with Central Coordination" The ROCs, VOs and other project- wide groups such as the middleware groups (JRA), network groups (NA), service groups (SA) areJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units Other grids (e.g. OSG)
21
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 21 The GGUS System
22
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 22 GGUS Portal: user services Browseable tickets Search through solved tickets Useful links (Wiki FAQ) Broadcast tools Latest News GGUS Search Engine Updated documentation (Wiki FAQ)
23
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 23 EGEE and EELA: Operations Cooperation between EGEE and EELA, in all areas, is very important This conference is an opportunity to explore some points where we can work better Starting discussions to apply standard EGEE operations procedures and tools into EELA: –Creation of own ROC To support EELA sites Initial support from CERN ROC –Site monitoring, SAM server being deployed by Alexandre Duarte
24
Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Maite Barroso: Grid Operations EELA, Santiago, 4 th September 2006 24 Summary EGEE operates the world’s largest multi-disciplinary grid infrastructure for scientific research –In constant and significant production use –EELA as part of this production infrastructure Operations procedures and tools under constant evolution –Much is being learned – but there remains much to be done to achieve long term sustainability –EELA is starting to use some of these tools/procedures; feedback plus additions are welcome! We have gained significant experience in what it takes to deploy, operate and manage a large distributed infrastructure –Next steps: Service Availability Monitoring, Service Level Agreements Importance of interoperability/interoperations with related projects, EELA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.