Download presentation
Presentation is loading. Please wait.
Published byDorthy Hopkins Modified over 6 years ago
1
SA1 Status Report EGEE Grid Operations & Management
Maite Barroso SA1 Activity Leader IT Department, CERN Final EU Review of EGEE-II CERN th July 2008
2
SA1 in Numbers Manpower: 61 partners, 29 countries, 228 FTE
EGEE-II Budget SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
3
The EGEE Infrastructure
Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Support Structures & Processes Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services Training infrastructure (NA4) Training activities (NA3) Operations Advisory Group (+NA4) Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
4
Cores, Sites, ROCs 73709 cores 255 sites (145 partner sites)
48 countries (33 partner countries) ROC Partner - DoW Partner - actual Total % non partner CERN 1800 4856 6676 27% France 1252 16203 0% De/CH 1852 8075 12536 36% Italy 2280 6548 6571 0.4% UK/I 2010 6618 12040 45% CE 1163 2959 4711 37% NE 1860 3207 4110 22% SEE 1289 3606 3608 0.1% SWE 898 1699 25% Russia 445 1378 1601 14% A-P 801 1912 3373 43% 15650 57061 73709 23% SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
5
Workload No. jobs / month No. jobs / month – exc. HEP, Infra
jobs/day (98000 jobs/day 1y ago) 54 million jobs in the 2nd year 150K per day sustained average No. jobs / month – exc. HEP, Infra jobs/day (13000 jobs/day 1y ago) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
6
CPU time delivered (CPU months)
exc. HEP, Infra Peak of 5700 CPU-month (3600 CPU-month) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
7
WLCG Common Computing Readiness Challenges
CCRC experience WLCG Common Computing Readiness Challenges Full-scale dress rehearsal for the accelerator run All experiments together Very demanding requirements, more than needed for accelerator run in 2008 Data transfers in excess of needed levels Workloads at scale needed for data taking E.g. only one experiment, CMS, submitted jobs a day routinely, day peak without problem, to egee and OSG production grids SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
8
CCRC: Data transfer results
All experiments exceeded required rates for extended periods, & simultaneously 1.3 GB/s target Well above 2 GB/s achievable All Tier 1s achieved (or exceeded) their target acceptance rates SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
9
CCRC: Data transfers - CMS
SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
10
WLCG Common Computing Readiness Challenges
CCRC experience WLCG Common Computing Readiness Challenges All this using EGEE production infrastructure and operations Reliable production service provided to WLCG Sustainable service model – people were not in panic mode Making use of interoperations with other grid infrastructures Site availability/reliability metrics, accounting, support, operations meetings All this with no additional effort No impact in daily operations SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
11
Interoperations Interoperation with OSG is day-to-day business
Permanent EGEE/OSG Interoperability Platform operated by SEE region User support processes interconnected (GGUS GOC) Accounting: Data published from Gratia to EGEE APEL repository Visualization through EGEE Accounting portal Agreed site availability/reliability metrics, stored in EGEE repository and visualized with EGEE tools NDGF interoperates with EGEE since Y2 Tests to probe the NDGF resources (arc-CEs) integrated in Service Availability Monitoring All other operations components are there: accounting, resource registration (GOCDB) Operation team from NDGF involved in EGEE grid Operator on Duty rota Interoperation with Naregi in progress Discussions started SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
12
Accounting GRATIA DGAS SGAS EGEE accounting portal OSG Accounting
Database GRATIA OSG Sites EGEE Sites EGEE accounting portal Central Accounting Database Summary Database INFN-Grid Accounting Database NDGF Accounting Database INFN-Grid Sites NDGF Sites DGAS SGAS SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
13
User support User Support
GGUS: process and tool well established, accepted and used by the community “A problem is not a problem if a GGUS ticket was not open” Problem reporting, logging and traceability VOs directly involved in shaping GGUS Recent new features User is now involved in the final closure of a ticket New status to simplify the work of ROCs Extensive tests before the release New GGUS ticket submission form with help for problem description and other precisions Escalation reports SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
14
Grid Operations Grid Operator on Duty
Critical activity in maintaining usability and stability of sites NDGF operations team joined Portal for operations : Regional dashboard concept: first level support for the sites in the region Continuous work on operations procedures Contribute to establishment of regional grid infrastructures through related projects – well beyond Europe now Solid set of operational tools provided for central operations teams Good suited for the present operational model, widely used Many are shared with other infrastructure projects SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
15
Job success rate Present job success rate between 80% - 95%
Main job failure reason is site misconfiguration Two aspects to improve this: In operations: Provide sites with tools to monitor and detect the problems as soon as possible: grid monitoring and alarms at the sites Operations support and training to site managers, so they learn to solve most common problems, quick involvement from experts to solve new ones Measure and publish site reliability In applications: Application specific monitoring of the sites: application specific tests, application dashboards Select “good sites” (the ones that successfully pass the application tests) from application point of view; experience shows that this gets reliability close to 100% This is done automatically in big VOs (E.g. LHC) and manually for small ones SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
16
Site reliability: early experience
Jan 08 Feb 08 Mar 08 Apr 08 May 08 Target 93 Average – 8 best site 96 95 98 Average – all sites 90 85 91 # above target (+>90% target) 7 +3 11 +1 CERN+Tier 1s Formal reporting of Tier 2s since October 2007 #sites reporting has increased from 89 116 in May 08 Overall average: 75-80%, but top 50% (20%) of sites: 95% (98%) More than 70% of resources are at sites with >90% reliability SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
17
Grid Operations Monitoring
Service Availability Monitoring (SAM): Provides monitoring of grid services from a user perspective Main source of monitoring information for site availability calculations All information stored and displayed centrally Changes to move grid monitoring information to the sites As a part of standard site monitoring, so it can raise alarms, etc First phase: feed grid monitoring results to sites Later, standard set of sensors to be run at the sites, they will push the information to a central repository Site status monitoring: after survey, most widely used are Nagios (open source) and Lemon Prototype based on the Nagios fabric monitoring system developed within the CE ROC Enables sites to receive instant notification in case of failures Provides them with results from global monitoring systems such as SAM and Network Monitoring SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
18
Site, regional, central monitoring
SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
19
Central probes (SAM) Local probes Network monitoring
SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
20
VO Monitoring SAM widely used by LHC VOs, plugging their own VO-specific SAM tests, to determine which sites are suitable Experiment dashboards extensively used by the LHC community VLMED VO (biomed) using the dashboard for a year now, others interested Dashboard framework also used in other areas: Experiment specific: e.g. ATLAS production, CMS site availability Interest in for operational dashboards SAM visualization Evolution similar to grid operations monitoring: Feed VO monitoring results to the sites Common mechanism SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
21
GridMap GridMap – high-level visualization of the grid availability
Collaboration with Industry – unfunded collaboration with EDS via CERN’s openlab project Display monitoring data it in a way that operators can absorb it, using advanced visualization techniques visualize the Grid by using Treemaps (Grid + Treemap = GridMap) GridMap is a visualization tool for looking at Service Availability and Reliability Condenses all EGEE sites into a single view More important problems are visually more distinctive Used in production by grid and operators Looking at other uses of the technique and technology E.g. Showing #Jobs, data transfer rates between sites from a VO perspective SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
22
Gridmap SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
23
Service Level Agreements
ROC-Site Service Level Description modeled on the service management recommendations of ITIL ~10 draft iterations, constructive input from both parties (ROCs and Sites), latest version: April ’08 Areas covered: Hardware and connectivity criteria Description of services covered Service hours Availability Support Service reporting and reviewing “SLAs relate to the measurement, reporting and reviewing of service quality as delivered by IT to the business”: Two ROCs have already signed SLDs with sites (South West Europe:8, South East Europe:2), others on-going. EGEE site availability metrics published since start of 2008: SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
24
Example Report Availability of a site over a given period is the fraction of time the same was UP Reliability of a site over a given period is the fraction of time the same was UP (Availability), divided by the scheduled availability SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
25
Operational Security Incident response
Operational Security Coordination Team (OSCT) Incident response EGEE Incident Response procedure for the sites Security Service Challenge 3 “fire drills” (Tier1s) Procedures generally understood Difficult to apply restrictions(being followed up by the MWSG) Lack of logging and traceability (being followed up by the MWSG) Number of communication problems and site misconfigurations uncovered Monitoring Central security tests (SAM) detected number of insecure configurations Promote security tools usage to the sites (Nagios) Training and dissemination Produced training material and recommendations (e.g. ISSEG project) Organised a security training event at EGEE 07 (successful) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
26
Security Policies and CA
Grid Security Vulnerability Group (GSVG) Handling security vulnerabilities in gLite Assess the risks of discovered vulnerabilities and provide security advisories Published 25 advisories in the past year JSPG New and reworked policies in the last year Aiming at making policies more generic and simple for wider adoption at other infrastructures EUGridPMA and IGTF The European Policy Management Authority for Grid Authentication in e-Science Establish requirements and best practices for grid identity providers Enable a common trust domain applicable to authentication of end-entities Mature and successful collaboration, distributed activity SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
27
Sustainability EGEE SA1 results:
Reliable, multi-VO, large scale production infrastructure Uninterrupted service Operational processes, tools and documentation Worldwide collaboration between ROCs and sites Built together with other national and international grid infrastructures Cooperation ensures geographical growth WLCG relies heavily on the present EGEE operations service and is dependent on its future continuation. This is an assurance for the durability of the EGEE operations results. To become more sustainable, in EGEE III we want to distribute the responsibility for daily operations and more automation to reduce manpower We are setting the groundwork for the migration to an NGI based model SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
28
Summary Infrastructure has continued to increase in size, scale, usage and reliability EGEE operations is able to cope with the increase without major changes in structure, processes or tools We have the right model Interoperation is a fact – used in production Distribution and automation, keys to reduce the effort in the coming years Setting the groundwork for the migration to an NGI based model SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
29
Key documents DSA1.4: Assessment of production service status
DSA1.5: Grid Operations Cookbook DSA1.6: Report on ROC progress and issues DSA1.7: Assessment of production Grid infrastructure service status Operations manual EGEE ROC-Site SLD EGEE Incident Response Procedure Virtual Organisation Operations Policy Grid Security Traceability and Logging Policy Approval of Certification Authorities Policy on Grid Multi-User Pilot Jobs SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.