Download presentation
Presentation is loading. Please wait.
Published byMarybeth York Modified over 8 years ago
1
www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 SA1 and JRA1 Operations and Operational Tools D. Cesini, JRA1 activity manager - INFN T. Ferrari, SA1 Activity Manager - EGI.eu SA1/JRA1: Operations and Operational Tools 1
2
www.egi.eu EGI-InSPIRE RI-261323 SA1 Activity Overview This slide will be provided by the PO It will summarise the activity in tables by: –51 partners, 5121 PMs –# people, # countries –The # PM and #FTE per country It will summarise the activity in graphics by: –The % effort of the activity within the project –The geographical spread across Europe SA1/JRA1: Operations and Operational Tools2
3
www.egi.eu EGI-InSPIRE RI-261323 JRA1 Activity Overview This slide will be provided by the PO It will summarise the activity in tables by: –The # partners, # people, # countries –The # PM and #FTE per country It will summarise the activity in graphics by: –The % effort of the activity within the project –The geographical spread across Europe SA1/JRA1: Operations and Operational Tools3
4
www.egi.eu EGI-InSPIRE RI-261323 SA1 Tasks and resource distribution TSA1.1 Activity Management – 0.70% TSA1.2 Secure Infrastructure (M. Ma/STFC) – 8.60% TSA1.3 Service Deployment Validation (M. David/LIP) – 11.00% TSA1.4 Infrastructure for Grid Management (E. Imamagic/ SRCE) – 20.66% TSA1.5 Accounting (J. Gordon/STFC) – 5.81% TSA1.6 Helpdesk infrastructure (T. Antoni/KIT) – 8.76% TSA1.7 Support Teams (R. Trompert/SARA) – 28.16% TSA1.8 Providing a Reliable Grid Infrastructure (C. Kanellopoulos/AUTH) - 16.31% SA1/JRA1: Operations and Operational Tools4
5
www.egi.eu EGI-InSPIRE RI-261323 JRA1 Tasks and resource distribution TJRA1.1 Activity Management (D. Cesini/INFN) – 7.6% TJRA1.2 Maintenance and development of the deployed operational tools (T. Antoni/KIT) – 41.6% TJRA1.3 Supporting National Deployment models (P. Solagna/EGI.eu) – 5.7% TJRA1.4 Accounting for usage of different resource types (J. Gordon/SFTC) – 28.5% –Cloud,HPC, DesktopGrid, –Storage/Data Usage –Application Usage –Billing system TJRA1.5 Integrated Operations Portal (C. L’Orphelin/CNRS) – 16.6% –Service Oriented model –Harmonization with GOCDB –Porting to Symfony –New DCI integration SA1/JRA1: Operations and Operational Tools5 JRA1 tasks sequencing YEAR 1YEAR 2YEAR 3YEAR 4 TJRA1.1 √√√√ TJRA1.2 √√√√ TJRA1.3 √ TJRA1.4 √√√ TJRA1.5 √√√
6
www.egi.eu EGI-InSPIRE RI-261323 SA1 and JRA1 Objectives Operate a secure, reliable European-wide federated production grid infrastructure that is integrated and interoperates with other grids worldwide –Maintain a secure infrastructure –Validate new technology releases (tools and middleware) –Support users and Resource Centre administrators –Quality control, grid oversight, documentation and procedures –Tools Operate tools, the accounting infrastructure and the EGI helpdesk Evolve the operational tools used by the production infrastructure –Maintenance, development and support of national deployment –Accounting for the use of new resources (desktop, virtualization, storage, data, application and billing) SA1/JRA1: Operations and Operational Tools6
7
www.egi.eu EGI-InSPIRE RI-261323 Operations Architecture 1/2 The Resource Centre (RC) – also known as Site – is the smallest resource administration domain in EGI. It can be either localised or geographically distributed. It provides a minimum set of local or remote UMD-compliant capabilities to access resources Resource Infrastructure - a federation of Resource Centres Resource infrastructure Provider (RP) - the legal entity responsible of the good running of the infrastructure –EGI participants NGIs, EIROs –Integrated infrastructure operated by a non-EGI-InSPIRE partner but relying on EGI operational services (MoU) –Peer infrastructure accessible to EGI users, but relying on own operational services SA1/JRA1: Operations and Operational Tools7
8
www.egi.eu EGI-InSPIRE RI-261323 Operational services –Infrastructure services –Technical services –Support services –Human services Operations Centre (OC) offers operational services on behalf of the Resource Infrastructure Provider in collaboration –with the Resource Centres Local Services –with EGI.eu Global Services SA1/JRA1: Operations and Operational Tools8 Operations Architecture 2/2
9
www.egi.eu EGI-InSPIRE RI-261323 EGI Resource Infrastructure SA1/JRA1: Operations and Operational Tools9 Resource Infrastructure Resource Centres Resource Infrastructure Resource Centres Resource Infrastructure Resource Centres Network Integrated Resource Infrastructure Provider National Grid Initiative Peer Resource Infrastructure Provider MoUs EGI.eu
10
www.egi.eu EGI-InSPIRE RI-261323 EGI Service Infrastructure 1/2 SA1/JRA1: Operations and Operational Tools10 Accounting (sensors, repositories, portals) EGI Helpdesk Central Tools Local Services (OCs) SA1 Global Services (EGI.eu) SA1 Infrastructure services Operational Tools JRA1 JRA1.2 Maintenance and development, KIT, 41.6% JRA1.3 Support of national deployment, EGI.eu, 5.7%, Y1 JRA1.4 Accounting, STFC, 28.5%, Y2-Y4 JRA1.5 Operations portal, CNRS, 16.6% Monitoring and other distributed tools TSA1.4 Infrastructure for Grid Management (SRCE) 20.66% TSA1.5 Accounting (STFC) 5.81% TSA1.6 Helpdesk infrastructure (KIT) 8.76% JRA1.1 Activity Management, INFN, 7.6%
11
www.egi.eu EGI-InSPIRE RI-261323 EGI Service Infrastructure 2/2 SA1/JRA1: Operations and Operational Tools11 Technology early adoption Local Services (OCs)Global Services (EGI.eu) Technical services TSA1.3 Service Deployment Validation (LIP) 11.00% TSA1.8 Reliable Grid Infrastructure (AUTH) 16.31% Interoperability Requirements Local core Grid servicesCentral core Grid Services Coordination Support services TSA1.7 Support Teams (SARA) 28.16% Quality control Human services TSA1.1 Activity Management, 0.70% TSA1.2 Secure Infrastructure (STFC) 8.60% TSA1.8 Reliable Grid Infrastructure (AUTH) Security and Incident Response Documentation Operations Management Coordination 3 rd line (SA2) Grid Oversight 1 st and 2 nd line
12
www.egi.eu EGI-InSPIRE RI-261323 Resource Infrastructure (yearly increase) SA1/JRA1: Operations and Operational Tools12 338 Resource Centres +6.8% Europe, Asia Pacific, North and South America 96 supporting MPI +31.5% 51 Countries (57 with integrated RPs) +18.75% Capacity 240,000 CPU cores (339,00 with integrated and peer RPs) 24.9% 1.89 Million HEP-SPEC 06* 102 PB disk, 89 PB tape * Million HEP-SPEC 06 Normalised CPU time to a reference value of Mega HEP-SPEC 06
13
www.egi.eu EGI-InSPIRE RI-261323 From federations to NGIs April 2010: 12 federated Operations Centres (OCs) April 2011: 32 OCs operating 40 European NGIs and 1 EIRO (CERN) 5 federated integrated Resource Infrastructures –Asia Pacific –Canada –Latin American and Caribbean Transition completed in January 2011 SA1/JRA1: Operations and Operational Tools13
14
www.egi.eu EGI-InSPIRE RI-261323 Usage statistics SA1/JRA1: Operations and Operational Tools14 MetricUnitPer monthPer day AVG Number Jobs (all VOs) number26.6 Million873,400 AVG Number Jobs (non-HEP VOs) number1.52 Million0.97 Million (+56.7%) CPU wall clock (all VOs) hours74.6 Million2.4 Million Normalized CPU wall clock (all VOs) HEP-SPEC 06 hours551 Million18.13 Million
15
www.egi.eu EGI-InSPIRE RI-261323 Availability and Reliability RC availability –RC available, if all services are available (logical AND of all service types) –Service available if any of the instances are available (logical OR) –Minimum availability: 70% 6 RCs suspended –Stricter suspension policy since April 2011: availability < 70% for three consecutive months RC reliability like availability but excluding scheduled interventions Monthly performance reports per Operations Centre and per RCMonthly performance reports New ticket-based procedure for monitoring of underperforming RCs SA1/JRA1: Operations and Operational Tools15 May-July 2010 August- October 2010 November 2010- January 2011 February-April 2011 May 2010- April 2011 Avg. Monthly Availability 93.3%90.7%92.3%94.5% 92.73% Avg. Monthly Reliability 94.3%91.9%93.3%95.8% 93.85%
16
www.egi.eu EGI-InSPIRE RI-261323 Infrastructure Services Operations Portal 1 Operations PortalOperations Portal (CNRS) Broadcast tool Operational Dashboard VO Identity Cards Achievements 8 releases package for local deployment released and updated Porting to a new web framework almost completed Improvements to all the modules –VO ID Cards module implementation driven by NA3 requirements Started integration with security dashboard New COD view released SA1/JRA1: Operations and Operational Tools16
17
www.egi.eu EGI-InSPIRE RI-261323 Infrastructure Services GOCDB SA1/JRA1: Operations and Operational Tools17 Achievements -Decommissioning of GOCDB3, release and deployment of GOCDB4 -Prototype for local deployment available but w/o synchronization system -Naming schema modification to integrate UNICORE services -GLUE2.0 compatibility for service names (ongoing) GOCDBGOCDB (STFC) -management of general and semi-static information about EGI RCs
18
www.egi.eu EGI-InSPIRE RI-261323 Infrastructure Services EGI Helpdesk EGI Helpdesk (KIT)EGI Helpdesk –distributed helpdesk with central coordination –Global Grid User Support (GGUS) Achievements –9 releases –Refresh of support teams and units and –Integration of the new NGIs (31 NGIs interfaced, 22 as support units, 6 with a local helpdesk) –Definition and implementation of new workflows for technology support (1 st line, 2 nd line and 3 rd line provided by the Technology Providers – EMI, IGE etc.) and the respective access privileges –Definition and implementation of new workflows for the support of software provisioning and bug reporting processes that involve EGI and its external technology providers –Integration with SNOW at CERN –Regional view (xGUS) in production SA1/JRA1: Operations and Operational Tools18
19
www.egi.eu EGI-InSPIRE RI-261323 Infrastructure Services Service Availability Monitoring (SAM) 1/2 SAM (CERN, AUTH, SRCE) monitoring framework for RCs and services data source to create A/R statistics one of the main data sources for the Operations Dashboard composed of various components: –The test submission framework: based on the NAGIOS system set up and customized by the NAGIOS Configurator (NCG) –The DataBase components: The Aggregated Topology Provider (ATP), the Metric Description DataBase (MDDB) and the Metrics Result DataBase (MRDB) –A message bus to publish the monitoring results –A visualization tool GUI: MyEGIMyEGI SA1/JRA1: Operations and Operational Tools19
20
www.egi.eu EGI-InSPIRE RI-261323 Achievements 8 releases (new EGI release procedure) myEGI visualization portal in production –MyEGI Web Service available –GridMap style plots added –New look and feel DB components re-engineering –ATP as new topology provider (replacing the old SAM database) Probes –Integration of ARC and GLOBUS5 probes (UNICORE in progress) –New CA distribution probe with automatic discovery of the right version Support for –robot certificates –monitoring of uncertified sites –authorization plugin (messaging infrastructure) for denial of all broker-to-broker communications (for accounting) Other –Creation of the second 2 nd level support and handover of probes development to EMI and IGE (in progress) SA1/JRA1: Operations and Operational Tools20 Infrastructure Services Service Availability Monitoring (SAM) 2/2
21
www.egi.eu EGI-InSPIRE RI-261323 Infrastructure Services Accounting Repository and Portal Accounting Repository (STFC) -usage of compute resources within the production infrastructure -based on gLite-APEL Accounting PortalAccounting Portal (FCTSG) GUI for the Accounting Repository Achievements integration of the APEL accounting system with the message broker network Porting of APEL tests to Nagios Design and implementation of a distributable Regional Accounting Server Portal modified to support new GOCDB4 PI and Ops Portal XML feeds NGI View added in the portal Decommissioning of central R-GMA accounting services (Feb 2011) SA1/JRA1: Operations and Operational Tools21
22
www.egi.eu EGI-InSPIRE RI-261323 Infrastructure Services Metrics Portal Metrics PortalMetrics Portal (FCTSG) manual/automatic collection of a set of metrics from different resources to measure the project performance and keep track of the project evolution. SA1/JRA1: Operations and Operational Tools22
23
www.egi.eu EGI-InSPIRE RI-261323 Technical Services Early adoption Re-engineering of the process for deployment of new software releases (Grid middleware and operational tools) and tuning of the tools for its automation –Staged Rollout –Early Adopters Achievements –Max number of components tested/rejected in staged rollout per PQ: 29/3 –Max number of staged rollout tests undertaken: 40 (PQ4) –Number of EA teams: 45 –Middleware stacks: ARC, gLite, UNICORE, GLOBUS (in progress) –Testing of the new staged rollout process in preparation to UMD 1.0 SA1/JRA1: Operations and Operational Tools23
24
www.egi.eu EGI-InSPIRE RI-261323 Technical Services Interoperability SA1/JRA1: Operations and Operational Tools24 Deployed middleware -ARC, gLite, UNICORE -More ARC and UNICORE installations expected in 2011 -Germany, Poland, Romania, The Netherlands, UK integrating GLOBUS and/or UNICORE Accomplishments ARC fully integrated in to GOCDB, accounting and SAM Integration of UNICORE and GLOBUS in progress OGF Production Grid Infrastructure WG Grid Interoperability Now WG Infrastructure Policy Group
25
www.egi.eu EGI-InSPIRE RI-261323 Technical Services Requirements gathering New process for requirements gathering (tools and deployed software) every 3 months –Gathering prioritization discussion at TCB SA1/JRA1: Operations and Operational Tools25
26
www.egi.eu EGI-InSPIRE RI-261323 Support Services Accomplishments Training and dissemination channels for new NGI support teams, monthly newsletter Most of the new NGIs successfully established their own local support structures Support for network performance issues in place (relying on tools for monitoring and troubleshooting) – contact point with NREN PERT teams But Grid oversight workload affected by new Operations Centres starting operations, now progressively reducing Support problems faced in some NGIs (Armenia, Israel, Romania, Russia) solved or under resolution SA1/JRA1: Operations and Operational Tools26 MetricValue AVG number of EGI tickets CREATED/month965 tickets (~constant) AVG monthly response time13.5 hours Median of monthly solution time [min, max][22, 39] hours
27
www.egi.eu EGI-InSPIRE RI-261323 Human Services Security and incident response Computer Security and Incident Response Team (CSIRT) Grid security monitoring, training and dissemination, and improvements in responses to incidentsComputer Security and Incident Response Team –Maintenance and development of Pakiti and Nagios –Security Service Challenge 4: 13 RCs tested (including WLCG Tier1 sites) and performance evalutated –Security training session at EGI TF 2010 Software Vulnerability Group (SVG) proactive examination of Grid middleware in order to find vulnerabilities that may exist (EMI joint effort)Software Vulnerability Group Procedures forProcedures –Software vulnerability handling –Security incident (exploited vulnerability) handling –Critical vulnerability handling Achievements –EGI CSIRT: 9 security incidents handled, 12 advisories issued (3 critical), support to RCs to mitigate the 3 critical vulnerabilities within the 7 days deadline –SCG: 29 software vulnerabilities reported (15 concerning Grid middleware, 4 fixed – the others have not passed their Target Date yet) –No RCs suspended SA1/JRA1: Operations and Operational Tools27
28
www.egi.eu EGI-InSPIRE RI-261323 Human Services Documentation Documentation collected at the EGI wiki (160 operations pages)wiki SA1/JRA1: Operations and Operational Tools28 9 new procedures defined and approvedprocedures 3 new manuals and 4 how-TOs (in progress)manuals Migration and update of existing legacy technical documentation (in progress) Mirroring of EGI wiki at ASGC
29
www.egi.eu EGI-InSPIRE RI-261323 Human Services Quality control Operational Level Agreements for the definition of duties, services and the related quality parameters –Resource Centre OLA (RC Services) approvedResource Centre OLA –Resource Provider OLA (Local Services) in progress –EGI.eu OLA (Global Services) Central Grid oversight team responsible of performance monitoring (new EGI procedure) SA1/JRA1: Operations and Operational Tools29
30
www.egi.eu EGI-InSPIRE RI-261323 Issues SA1: Integration of new NGIs –Albania, Moldova JRA1 –Development for local deployment of tools delayed (hiring issues) TJRA1.3 ended in PY1 extension? 63% of effort used –Limited effort for accounting development (89 PMs over 3 years) vs ambitious roadmap prioritization needed –Lacking effort for 2 nd level support of those tools that are locally deployed (e.g. SAM and Operations Portal) SA1/JRA1: Operations and Operational Tools30
31
www.egi.eu EGI-InSPIRE RI-261323 Use of Resources SA1 –98% PMs achieved (aggregated) –EGI.eu Global Services Some marginal cases of overspending due to transition TSA1.8E: 59% achieved due to issues in claiming effort within the JRU (but activities successfully delivered) –NGI Local Services Few cases of under/overspending that will be compensated over the duration of the project JRA1 –69% PMs achieved (aggregated) –Underspending in TJRA1.2 and TJRA1.3 (next slide), and TJRA1.5 (CNRS) 76% Harmonization of operations portal with GOCDB not started yet SA1/JRA1: Operations and Operational Tools31
32
www.egi.eu EGI-InSPIRE RI-261323 Use of Resources (JRA1) Underspent by almost all the partners –Total spent 63% Development not completed Use cases definition not completed in some cases (GOCDB) Regional accounting portal not needed without a regional accounting repo Extension of TJRA1.3 duration? SA1/JRA1: Operations and Operational Tools32 Total spent 86% CERN underspent it – no particular issues GRNET – contractual problems at the beginning of the project Can be reabsorbed during the coming years – 4 years task
33
www.egi.eu EGI-InSPIRE RI-261323 Plans for next year SA1 –Increasing participation to Staged Rollout –Integration New NGIs and MoUs with new integrated RPs UNICORE and GLOBUS (completed) desktop grids and PRACE (pilots) –Operational tools availability reports (Global and Local) –Automation of availability control processes –Day-by-day operations (security, support, oversight) JRA1 –Accounting New APEL Publisher (stomp-based) September 2011 Regional Accounting Server packaged and released to NGIs December 2011 new resources and billing (roadmap under discussion) –Regionalization completed ( synchronisation system for regional GOCDB) –Operations Portal: Integration of security dashboard, creation of VO dashboards SA1/JRA1: Operations and Operational Tools33
34
www.egi.eu EGI-InSPIRE RI-261323 Summary Operations Management Board (OMB) and Operational Tool Advisory Board (OTAG) successfully established with members from participates and integrated RPs (18 meetings in total, 5 of which face-to-face)OMBOTAG 6 task forces: UNICORE and GLOBUS integration, tool regionalization, OLA, network support6 task forces New operations architectureoperations architecture New procedures and processes for software deployment and change management NGIs at different levels of maturity but active, increasingly sustainable and offering quality services even during transition 28 operational tool releases (8 in PQ1, 6 in PQ2, 6 in PQ3, 8 in PQ4) SA1/JRA1: Operations and Operational Tools34
35
www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Backup slides SA1/JRA1: Operations and Operational Tools35
36
www.egi.eu EGI-InSPIRE RI-261323 SA1/JRA1: Operations and Operational Tools36 From federations to NGIs 30/05/2011 36 Denmark, Finland, Norway, Sweden Estonia, Latvia, Lithuania Poland, Slovenia Croatia, Slovakia, Check Republic, Belarus, Hungary, Austria; Germany, Switzerland; Portugal-Spain Greece, Serbia, Turkey, Romania, Cyprus, Georgia, FYR of Macedonia, Bosnia and H., Montenegro, Bulgaria, Armenia Italy May 2010: Creation of NGI_NDGF PQ1 Stop of Central Europe ROC and DE-CH ROC Creation of IberGrid PQ2 Jan 2011: Stop of South East Europe ROC PQ3 Feb-March 2011: Stop of South West Europe ROC and Italy ROC PQ4
37
www.egi.eu EGI-InSPIRE RI-261323 Technology Helpdesk DMSU EGI-SA2 Technology Provider (EMI / IGE) TPM GGUS RT Technology Helpdesk announce accept/reject Workflow for Bugs found in production Technology release workflow SA1/JRA1: Operations and Operational Tools37
38
www.egi.eu EGI-InSPIRE RI-261323 GGUS BACKUP GGUS HIGH AVAILABILITY Active – active high availablility concept Data layer done Logic and presentation layer ready by the end of the first half of 2011
39
www.egi.eu EGI-InSPIRE RI-261323 SAM BACKUP SA1/JRA1: Operations and Operational Tools39
40
www.egi.eu EGI-InSPIRE RI-261323 OPS PORTAL BACKUP SA1/JRA1: Operations and Operational Tools40
41
www.egi.eu EGI-InSPIRE RI-261323 OPS PORTAL BACKUP Envisaged solution for GOCDB/OPS_PORTAL harmonization SA1/JRA1: Operations and Operational Tools41 Current Situation Harmonized tools
42
www.egi.eu EGI-InSPIRE RI-261323 GOCDB BACKUP SA1/JRA1: Operations and Operational Tools42 CENTRAL GOCDB4 WSGU I GOCDB module REGION / NGI Local users INPUT GOCDB4 WSGU I GOCDB module central users EGI tools Read/Write Read only GOCDBPI_v4 GOCDB4 DATABASE GOCDB4 data schema is designed in an object fashion. This object model is implemented at database level using a methodology known as Pseudo-Relational Object Model (PROM) GOCDB PI GOCDB Programmatic Interface is a REST (Representational State Transfer) based interface over https. REST URLs are properly secured when transiting. Some of the methods are nonetheless public and don't require client side authentication Output format is XML. There are currently 3 protection levels for all methods of the interface (public, private and protected) https://wiki.egi.eu/wiki/G OCDB/Release4/Architect ure
43
www.egi.eu EGI-InSPIRE RI-261323 Accounting SA1/JRA1: Operations and Operational Tools43
44
www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 EGI RT staged-rollout EGI RT staged-rollout EGI Repository EGI Repository EGI Mail manager GGUS StageRollOut New Open Staged Rollout Manager Repository URL EGI Document server DB EGI Document server DB EGI Mail manager Select EAs EA teams EA 1 EA n … notification Test? accept :reject Verification accept Verification accept Test ? reject GGUS ticket GGUS ticket Technology Providers DMSU assign URL reference Report ID Staged Rollout managers accept Report outcome? accept:reject submit URL reference Resolved Outcome? accept:reject Staged Rollout test Staged Rollout test Staged Rollout Manager Staged rollout done
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.