Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Federated Operations Solution Małgorzata Krakowian EGI.eu, Senior Operations.

Similar presentations


Presentation on theme: "Www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Federated Operations Solution Małgorzata Krakowian EGI.eu, Senior Operations."— Presentation transcript:

1 www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Federated Operations Solution Małgorzata Krakowian EGI.eu, Senior Operations Officer Diego Scardaci EGI.eu/INFN, JRA2 Activity Manager 13/02/2015 1

2 www.egi.eu EGI-InSPIRE RI-261323 Outline PART I Federated Operations solution Target groups, Challenges, Components Operations in PY5 EGI-Inspire Summary, Future plans PART II Operations tools JRA2 Overview: effort, tasks and objectives Operations tools in PY5, Analysis EGI-Inspire summary, Future plans 13/02/2015 Federated Operation Solution2

3 www.egi.eu EGI-InSPIRE RI-261323 Outline PART I Federated Operations solution Target groups Challenges Components Operations in PY5 EGI-Inspire summary Future plans 3 13/02/2015 Federated Operation Solution

4 www.egi.eu EGI-InSPIRE RI-261323 Federated Operations solution The technologies, processes and people required to provide a cost- efficient framework to manage operations within a federated environment, while retaining responsibility of local infrastructure Target groups: Primarily at Research Infrastructures and Resource Centres already within the EGI community or those wishing to become part of it May also be used by other IT service providers that are geographically and/or structurally dispersed, but plan to organise themselves for federated service provision 4 13/02/2015 Federated Operation Solution

5 www.egi.eu EGI-InSPIRE RI-261323 Federated Operations solution 5 Metrics (January 2015) Value Countries Countries EGI-InSPIRE Partners 39 Including integrated RPs 5454 Operations Centres Countries EGI-InSPIRE Partners3 Including integrated RPs 37 Resource Centers Countries EGI-InSPIRE Partners 310 Including integrated RPs 352 Integrated EGI-InSPIRE Partners and EGI Council Participants Internal/External RPs being integrated External RP Peer RP 13/02/2015 Federated Operation Solution EGI community

6 www.egi.eu EGI-InSPIRE RI-261323 Lack of integration A common core infrastructure platform based on standards, common interfaces and protocols, communication, planning and coordination Lack of expertise and specific knowledge in integration or coordination, which leads to duplication of services or inefficient use of effort Centrally-provided expertise and streamlined best practices on how to set up and manage federation Beta-testing of applications and services in production Federated service management best practices, cost-effective sharing of services community expertise & re-use of tools/output from public funded projects Loss of efficiency resulting from the diversion of resources to implement integration, duplication of services Existing technical solutions that can be adapted / re-used 6 13/02/2015 Federated Operation Solution Federated Operations solution Challenges and solutions

7 www.egi.eu EGI-InSPIRE RI-261323 Technology Coordination ensures continuous technological innovation through sourcing of software components from diverse technology providers to meet the current and emerging needs of both researchers and Resource Centres. Security Coordination ensures a secure and stable infrastructure to mitigate threats, enhance services, and give users the protection and confidence they demand from a service. A secure infrastructure is naturally a top priority. Federated Operation services brings together the tools, processes and people necessary to guarantee standard operation of heterogeneous infrastructures from multiple independent providers, with lightweight central coordination. Helpdesk Support provides professional, reliable and efficient technical support to guarantee a well-run infrastructure with improved productivity and usability for the customers. It requires certification so it is only provided to Resource Centres that are within the EGI community. Specialised Consultancy offers tailored technical and management advice to help partners and clients make the most out of e- Infrastructure technologies. Operations Coordination is a set of management and coordinating activities ensuring that operational activities across the federated infrastructure work seamlessly, without fragmentation. The coordination binds the infrastructure so that the services are delivered at an agreed service le 7 13/02/2015 Federated Operation Solution Federated Operations solution Components

8 www.egi.eu EGI-InSPIRE RI-261323 Operational tools 1.Message Broker Network 2.Operations Portal 3.Accounting Repository 4.Accounting and Metric Portal 5.SAM central services 6.Monitoring central services 7.Security monitoring and related support tools 8.Service registry (GOCDB) 9.Catchall services 10.Incident management helpdesk 11.Collaboration tools/IT support 12.Software provisioning infrastructure 8 Human activities 1.Operations support 2.Security coordination 3.Acceptance criteria 4.Staged Rollout 5.1st and 2nd level support The core services that enable the EGI federation. Support activates and operations tools delivery and maintenance. Supported by council fees (40%) and in-kind contributions of the partners. 13/02/2015 Federated Operation Solution Federated Operations solution EGI.eu Core Services

9 www.egi.eu EGI-InSPIRE RI-261323 9 The provisioning of all core services is regulated by dedicated agreements Operational Level Agreements (OLAs) between the providers and EGI.eu Service Level Agreement (SLA) between EGI.eu and the consumers (NGIs) of the services. All core services continued without interruptions during the transition from PY4 to PY5 The performance in PY5 well in the scope of the agreed level targets (availability, reliability, quality of support) Maintenance releases and the tools were enhanced with some additional features FitSM recommendation (common process) concerning change, release and deployment management are being putting in place Active support for operations campaigns and other activities 13/02/2015 Federated Operation Solution Federated Operations solution EGI.eu Core Services

10 www.egi.eu EGI-InSPIRE RI-261323 10 An external FitSM audit Identifying the areas of improvement audited the Federated Operations service in November 2014 A plan for the full implementation of the service management processes required by FitSM is being defined in order to increase the service quality delivered with repeatable and reliable processes. EGI.eu FitSM in numbers: Certification Advanced: 4 (soon more) Foundation: 10 (50+ across EGI community) Trainers: 2 13/02/2015 Federated Operation Solution Federated Operations solution EGI Service Management

11 www.egi.eu EGI-InSPIRE RI-261323 Operations in PY5 Effort 99.6 PMs (1123 PMs in PY4), Achieved 90% use of resources The operational activities at NGI level, not funded anymore by the project, have been sustained by the NGIs. The communication channels between the EGI partners implemented at the beginning of the project were seamlessly provided during PY5 Coordination activities include the supervision of the core services now funded by council fees and in-kind contributions. Operations campaigns dCache storage element - Version 2.2.x had a longer support period than the other UMD-2 components APEL clients - Migration from UMD-2 to UMD-3 client versions was not transparent VOMS for ops VO (used in monitoring) - Ops VOMS servers must be configured in every user facing monitored service Further integration with OSG CVMFS: enabled common application software distribution for the Vo’s Accounting: import to the EGI Accounting repositories usage of US resources by EGI VOs 13/02/2015 Federated Operation Solution NA4.1 Operations coordination

12 www.egi.eu EGI-InSPIRE RI-261323 Service Level Agreement and Operational Level Agreement framework The definition of clear and complete framework: Federated Operations Service EGI.eu OLA - agreed with each of EGI partners responsible for given service component – in place EGI.eu SLA - agreed with EGI members – in place High-Throughput Computing and Cloud Computing Platform VO SLA – implementation phase into Resource Allocation tool VO OLA – implementation phase into Resource Allocation tool EGI Infrastructure Resource Center OLA – in place Resource infrastructure Provider OLA – in place Technology Provider UA – final changes 12 13/02/2015 Federated Operation Solution Operations in PY5 NA4.1 Operations coordination

13 www.egi.eu EGI-InSPIRE RI-261323 13 Incident Prevention including Vulnerability Handling (security monitoring, security intelligence, assessment of known vulnerabilities with the support of SVG, preparation of advisories) 29 vulnerabilities handled by SVG 19 Security Groups advisories - 3 critical, 8 high risk 70 CSIRT tickets tracking site response to advisories Incident Response (incident handling including investigation, heads up, coordination with site CSIRTs, forensics, technical support, advisories, reports) 15 security incidents in PY5 (2 in PY1, 10 in PY2, 3 in PY3, 10 in PY4) 13/02/2015 Federated Operation Solution Operations in PY5 Security

14 www.egi.eu EGI-InSPIRE RI-261323 14 EGI CSIRT New members: 2 from Fed Cloud,1 from WLCG/CERN, 1 other Fewer vulnerabilities concerned Grid Middleware in the last few months, more VO software and commercial software EGI CSIRT has been successfully certified by the TF-CSIRT Steering Committee.(Oct 2014) Leading a new activity (Sirtfi) building a Trust Framework for security operations in the national federations identity and eduGAIN Made good progress on understanding issues for Security Policy and Operations in EGI Federated Cloud 13/02/2015 Federated Operation Solution Operations in PY5 Security

15 www.egi.eu EGI-InSPIRE RI-261323 Currently EGI users are mostly using x509 credentials to access the production services User communities who need to access EGI with username/password are using science gateways to hide the need to deal with the certificates. The medium term strategy is to replicate the current architecture to manage user communities in the other authentication technologies already used by the users. It will allow users to use their existing institutional credentials. Started during PY5 the deployment of a pilot on cloud services Study the feasibility of direct integration of federated identities in EGI services 15 13/02/2015 Federated Operation Solution Operations in PY5 Authentication and Authorization Infrastructure

16 www.egi.eu EGI-InSPIRE RI-261323 Effort 53.3 PMs, Achieved 108% use of resources Improvement and formalisation of the operational activities and processes for the federated Cloud infrastructure, this includes supporting new sites to join the federated cloud. Site certification procedure improvements: One procedure for HTC and Cloud Manual check instructions Information security checks Campaigns Accounting publishing and including VO information in records Information system publishing GOC DB – VM image descriptions Supporting dteam VO (infrastructure VO meant for testing and troubleshooting) 13/02/2015 Federated Operation Solution Operations in PY5 SA5.1 Operating a reliable federated institutional IaaS Cloud service

17 www.egi.eu EGI-InSPIRE RI-261323 Service support & improvement activities: GGUS support units (SU) – A set of dedicated support unit were set up in GGUS to track operational incidents in the federated Cloud infrastructure Availability & Reliability monitoring – A/R metrics are generated and collected on a monthly basis with other OLA service level targets. CMF release and deployment management – preparation for Cloud Management Framework (CMF) integration code being released and deployed using the UMD process. Resource provisioning – Support for cloud sites in Resource Allocation process; Cloud resources can be offered through E-grant CMF production infrastructure integration – A new procedure regarding the integration of new Cloud Management Frameworks and Grid middleware in the EGI production infrastructure User related activities: VO management – Updating VO creation and VO decommission procedures to include cloud VOs creation and support User SLA – Working on first EGI User SLA document based on Cloud use case with Biovel community 13/02/2015 Federated Operation Solution Operations in PY5 SA5.1 Operating a reliable federated institutional IaaS Cloud service

18 www.egi.eu EGI-InSPIRE RI-261323 Operations Summary EGI-Inspire Expansion Increase of usage and resource infrastructure (RP, RC) Federation NGI structure, regionalization solutions, engagement of partners Service Management Processes, Agreements’ framework (OLA, SLA, UA, MoU), knowledge (FitSM) Sustainability EGI.eu Core services Integration Technology, processes, procedures, tools, activities 18 13/02/2015 Federated Operation Solution

19 www.egi.eu EGI-InSPIRE RI-261323 Operations Future plans Coordinate the operational activities of the EGI production infrastructure, ensuring a secure and reliable provisioning of grid, cloud and storage resources, harmonised between resource providers and peer e-Infrastructures. Evolve the security activities in EGI to support the new technologies and resource provisioning paradigms, maintaining a secure and trustworthy infrastructure while supporting new use cases and new ways to access the resources; Integrate and deploy platforms on cloud and grid resources to support new use cases for the existing and new EGI users these platforms will include services for the long-tail of science that will reduce both the barriers for new users to access EGI resources and the learning curve to efficiently use them. 19 13/02/2015 Federated Operation Solution

20 www.egi.eu EGI-InSPIRE RI-261323 Outline PART II Operations tools JRA2 Overview: effort, tasks and objectives Operations tools in PY5 Use of resources Issues EGI-Inspire summary Future plans 20 13/02/2015 Federated Operation Solution

21 www.egi.eu EGI-InSPIRE RI-261323 JRA2 Overview 21 WPTaskBeneficiaryTotal PMs WP7-ETJRA1.1INFN24 WP7-ETJRA1.2KIT-G47 WP7-ETJRA1.2CSIC12 WP7-ETJRA1.2CNRS12 WP7-ETJRA1.2GRNET12 WP7-ETJRA1.2SRCE12 WP7-ETJRA1.2STFC24 WP7-ETJRA1.2CERN12 WP7-GTJRA1.3CSIC3 WP7-GTJRA1.3CNRS3 WP7-GTJRA1.3SRCE3 WP7-GTJRA1.3STFC3 WP7-GTJRA1.3CERN6 WP7-GTJRA1.4KIT-G18 WP7-GTJRA1.4CSIC18 WP7-GTJRA1.4INFN26 WP7-GTJRA1.4STFC27 WP7-GTJRA1.5CNRS53 7 Countries 8 Beneficiaries PY4 effort 70 PMs 6 FTEs Total effort -315 PMs - 26 FTEs 13/02/2015 Federated Operation Solution

22 www.egi.eu EGI-InSPIRE RI-261323 JRA2 Tasks & Objectives 22 13/02/2015 Federated Operation Solution Continuation of software development for a subset of tools that require further development to support pay per use proof of concepts, and the operation of the Federated Cloud: monitoring, accounting, application resource allocation database and system TaskLeader TJRA2.1 Service Availability Monitoring (39%): Evolve the current Service Availability Monitoring framework towards a more lightweight and customisable solution Christos Kanellopoulos / GRNET TJRA2.2 Accounting (37%): Cloud accounting towards a production system Improvement of CPU, parallel jobs and storage accounting Support of new resource types as the GPGPU Stuart Pullinger/ STFC TJRA2.3 Application DB (17%): Developments of the Application DB to support use cases for the EGI Federated Cloud Marios Chatziangelou / IASA TJRA2.4 e-Grant (7%): Evolve e-Grant to improve the resource allocation procedures T. Szepieniec / CYFRONET

23 www.egi.eu EGI-InSPIRE RI-261323 Service Availability Monitoring 13/02/2015 Federated Operation Solution23 Evolve the current Service Availability Monitoring framework towards a more lightweight and customisable solution ARGO 1 ‒ Full SAM refactor ‒ Based on the outcome of TSA4.10 (mini-project) –New extensible and scalable A/R engine –Easy inclusion of new middleware –Removed dependency from commercial products (Oracle) –Hadoop and MongoDB to store large datasets –Standalone instance 1. http://en.wikipedia.org/wiki/Argo Operations Tools in PY5

24 www.egi.eu EGI-InSPIRE RI-261323 Service Availability Monitoring ARGO A/R framework –A/R compute engine metrics for Services, Sites, NGIs and VOs based on Hadoop & MongoDB: able to store/manage large datasets supports multiple/custom topologies, result set and availability profiles –REST API retrieve the A/R results for a specific Availability Profile, time period, Service Flavour, Site, NGI or VO manage Availability Profiles request re-computations –Web UI - visualization & management interfaces replace MyEGI (no Oracle dependency) uniform look & feel with the operations portal ARGO monitoring engine –Replace SAM Nagios –Simplified thanks to the usage of the REST API –Removed Oracle dependency 13/02/2015 Federated Operation Solution24 Operations Tools in PY5

25 www.egi.eu EGI-InSPIRE RI-261323 Accounting Improvement of CPU, parallel jobs and storage accounting and support of new resource types as the GPGPU Parallel jobs accounting data available in the portal: –changes to the central repository to periodically send MPI data summary to the portal New view to show accounting data (official view soon) –compliant with CAR (EMI3) standard –integrating the MPI accounting data New versions of the APEL accounting software were released –bug fixing Tests on storage accounting Feasibility study to account GPGPU 13/02/2015 Federated Operation Solution25 Operations Tools in PY5

26 www.egi.eu EGI-InSPIRE RI-261323 Accounting Evolution of the cloud accounting towards a production system and support pay per use proof of concepts. Pay per use ‒ Help in the definition of prices for grid, cloud and storage ‒ New views in the portal to allow the estimation of the average monetary cost of the used resources 13/02/2015 Federated Operation Solution26 Cloud Accounting –New accounting probes developed in collaboration with the EGI Federated Cloud TF n. of VMs, CPU, RAM, Disk, Network traffic –New usage record: improving information in ImageId field benchmarking –Hybrid (grid and cloud) VO manager view in the portal Operations Tools in PY5

27 www.egi.eu EGI-InSPIRE RI-261323 Application DB 13/02/2015 Federated Operation Solution27 Support use cases for the EGI Federated Cloud ‒ 6 major releases in PY5 –Integration with Information System and GOCDB ‒ Support for the eduGAIN federated AAI ‒ Access to the AppDB entries through the EBI’s IDP (ELIXIR) Operations Tools in PY5

28 www.egi.eu EGI-InSPIRE RI-261323 Application DB 13/02/2015 Federated Operation Solution28 Operations Tools in PY5 ‒ VO-wide image lists of VA: list of images available in all the sites supporting a given VO ‒ Contextualisation support software to customise/configure VAs ‒ EGI FedCloud sites view (OCCI + VAs) ‒ Integration with OpenAire entities/people interconnections ‒ Software appliances ‒ Participation to the pilot Integrating ELIXIR Reference Datasets into EGI Support use cases for the EGI Federated Cloud

29 www.egi.eu EGI-InSPIRE RI-261323 e-Grant 13/02/2015 Federated Operation Solution29 Evolve e-Grant to improve the resource allocation procedures Resource allocation process –approximate pool matching –expiring Pools, enabling / disabling Pools –implementation of e-mail notifications –improvements on pool definition and metrics description Support to the EGI Federated Cloud –allowing customers to request EGI Federated Cloud resources –allowing EGI Federated Cloud providers to create resource pools –allocation of Federated Cloud resources Operations Tools in PY5

30 www.egi.eu EGI-InSPIRE RI-261323 e-Grant 13/02/2015 Federated Operation Solution30 Support to the pay per use Working Group –team joined the pay-for-use PoC –integration with EGI GOCDB to import data about resource prices –create a Resource Pool with Pay-for-Use resources (HTC & Cloud) –catalogue of the available pools Operations Tools in PY5

31 www.egi.eu EGI-InSPIRE RI-261323 Use of Resources/JRA2 112% PMs achieved (aggregated) TJRA2.1  80% PMs achieved delay on hiring the new personnel needed; work completed on January 2015 TJRA2.2  142% PMs TJRA2.3  105% PMs TJRA2.4  150% PMs CNRS, FCTSG, CYFRONET allocated unfunded effort to properly complete their tasks satisfy new emerging requirements. 2-3 July 2014 SA1 and JRA1 Operations and Operational Tools 31 Analysis

32 www.egi.eu EGI-InSPIRE RI-261323 PY5 Issues/JRA2 AppDB Huge amount of requirements from the FedCloud TF Mitigation: Requirements prioritised in collaboration with the FedCloud TF Roadmap defined according to these priorities All most relevant requirements fully satisfied 32 13/02/2015 Federated Operation Solution Analysis

33 www.egi.eu EGI-InSPIRE RI-261323 Operations tools Future Plans Service Registry and Marketplace Simplify the access to the infrastructure services through new services in the area of Service Registry and Marketplace Accounting Evolve the EGI accounting system to manage the data deluge expected over the next years Including new types of accounting metric (e.g. data accounting) redesigning of the portal to improve the user experience Other tools Adapt the operations tools to new technologies and to satisfy new requirements emerging Define interfaces to create a network of analogue tools providing users with integrated view of all the infrastructures involved ARGO in production 33 13/02/2015 Federated Operation Solution

34 www.egi.eu EGI-InSPIRE RI-261323 Operations tools Summary EGI-Inspire Transparent integration of other infrastructures Operational tools re-designed to make them technology agnostic Regionalisation solution offered by each tool independent tool instance regionalized view inside the central instance offering views of integrated infrastructures Integration of new technologies and resources 111 service types defined in the GOCDB ‒ gLite, UNICORE, Globus, iRODS, ARC, QosCosGrid, BES, Cloud, Torque, Squid, XRootD, COMPSs, Dirac, etc Monitoring framework able to monitor services from: ‒ gLite, UNICORE, Globus, ARC, QosCosGrid, Desktop Grids and Cloud Accounting Repository (SSM v2) able to account: ‒ Cloud (Virtual Machines), CPU, multi-thread Jobs and Storage 34 13/02/2015 Federated Operation Solution

35 www.egi.eu EGI-InSPIRE RI-261323 Questions Members of the EGI-InSPIRE collaboration thank the EC for supporting EGI 35 13/02/2015 Federated Operation Solution

36 www.egi.eu EGI-InSPIRE RI-261323 References Operations EGI.eu Core services/Activities https://wiki.egi.eu/wiki/Core_EGI_Activities EGI.eu Operation Level Agreement https://documents.egi.eu/secure/ShowDocument?docid=2170 EGI.eu Federated Operations Service Level Agreement https://documents.egi.eu/public/ShowDocument?docid=2166 Site certification https://wiki.egi.eu/wiki/PROC19 Site certification manual tests instruction https://wiki.egi.eu/wiki/HOWTO04_Site_Certification_Manual_tests Site security certification https://wiki.egi.eu/wiki/EGI_CSIRT:Security_Resource_Centre_Certification_Procedure Federated Cloud UMD survey https://www.surveymonkey.com/r/FedCloud_UMD Integration of new cloud management framework and grid middleware in EGI Production Infrastructure procedure https://wiki.egi.eu/wiki/PROC19 36 13/02/2015 Federated Operation Solution

37 www.egi.eu EGI-InSPIRE RI-261323 References Operation tools EGI.eu Core services/Activities https://wiki.egi.eu/wiki/Core_EGI_Activities 37 13/02/2015 Federated Operation Solution


Download ppt "Www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Federated Operations Solution Małgorzata Krakowian EGI.eu, Senior Operations."

Similar presentations


Ads by Google