SA1 Status Report EGEE Grid Operations & Management

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Advertisements

LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
UKI ROC/GridPP/EGEE Security Mingchao Ma Oxford 22 October 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
Summary of AAAA Information David Kelsey Infrastructure Policy Group, Singapore, 15 Sep 2008.
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Gergely Sipos Activity Deputy Manager MTA.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
EGI-InSPIRE Steven Newhouse Interim EGI.eu Director EGI-InSPIRE Project Director Technical Director EGEE-III 1GDB - December 2009.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse (substituting for Maite.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Security Policy: From EGEE to EGI David Kelsey (STFC-RAL) 21 Sep 2009 EGEE’09, Barcelona david.kelsey at stfc.ac.uk.
Security Policy Update WLCG GDB CERN, 14 May 2008 David Kelsey STFC/RAL
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
JSPG Update David Kelsey MWSG, Zurich 31 Mar 2009.
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGI Process Assessment and Improvement Plan – EGI core services – Tiziana Ferrari FedSM project 1EGI Process Assessment and Improvement Plan (Core Services)
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
Security Policy Update WLCG GDB CERN, 11 June 2008 David Kelsey STFC/RAL
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
JRA1 Middleware re-engineering
Daniele Bonacorsi Andrea Sciabà
Bob Jones EGEE Technical Director
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
JRA2: Quality Assurance
Regional Operations Centres Core infrastructure Centres
EGI – Round table discussion
JRA3 Introduction Åke Edlund EGEE Security Head
SA1 Execution Plan Status and Issues
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
POW MND section.
LCG/EGEE Incident Response Planning
Long-term Grid Sustainability
Service Level Agreement/Description between CE ROC and Sites
Report on SLA progress Ioannis Liabotis <ilaboti at grnet.gr>
EGEE support for HEP and other applications
Readiness of ATLAS Computing - A personal view
The LHC Computing Grid Visit of Her Royal Highness
Infrastructure Support
Romain Wartel EGEE08 Conference, Istanbul, 23rd September 2008
Maite Barroso, SA1 activity leader CERN 27th January 2009
Nordic ROC Organization
LCG Operations Workshop, e-IRG Workshop
Connecting the European Grid Infrastructure to Research Communities
Solutions for federated services management EGI
Input on Sustainability
Leigh Grundhoefer Indiana University
Collaboration Board Meeting
Presentation transcript:

SA1 Status Report EGEE Grid Operations & Management Maite Barroso SA1 Activity Leader IT Department, CERN Final EU Review of EGEE-II CERN 8-9th July 2008

SA1 in Numbers Manpower: 61 partners, 29 countries, 228 FTE EGEE-II Budget SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

The EGEE Infrastructure Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Support Structures & Processes Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services Training infrastructure (NA4) Training activities (NA3) Operations Advisory Group (+NA4) Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Cores, Sites, ROCs 73709 cores 255 sites (145 partner sites) 48 countries (33 partner countries) ROC Partner - DoW Partner - actual Total % non partner CERN 1800 4856 6676 27% France 1252 16203 0% De/CH 1852 8075 12536 36% Italy 2280 6548 6571 0.4% UK/I 2010 6618 12040 45% CE 1163 2959 4711 37% NE 1860 3207 4110 22% SEE 1289 3606 3608 0.1% SWE 898 1699 25% Russia 445 1378 1601 14% A-P 801 1912 3373 43% 15650 57061 73709 23% SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Workload No. jobs / month No. jobs / month – exc. HEP, Infra 188.000 jobs/day (98000 jobs/day 1y ago) 54 million jobs in the 2nd year 150K per day sustained average No. jobs / month – exc. HEP, Infra 17.000 jobs/day (13000 jobs/day 1y ago) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

CPU time delivered (CPU months) exc. HEP, Infra Peak of 5700 CPU-month (3600 CPU-month) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

WLCG Common Computing Readiness Challenges CCRC experience WLCG Common Computing Readiness Challenges Full-scale dress rehearsal for the accelerator run All experiments together Very demanding requirements, more than needed for accelerator run in 2008 Data transfers in excess of needed levels Workloads at scale needed for data taking E.g. only one experiment, CMS, submitted 100.000 jobs a day routinely, 200.000 day peak without problem, to egee and OSG production grids SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

CCRC: Data transfer results All experiments exceeded required rates for extended periods, & simultaneously 1.3 GB/s target Well above 2 GB/s achievable All Tier 1s achieved (or exceeded) their target acceptance rates SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

CCRC: Data transfers - CMS SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

WLCG Common Computing Readiness Challenges CCRC experience WLCG Common Computing Readiness Challenges All this using EGEE production infrastructure and operations Reliable production service provided to WLCG Sustainable service model – people were not in panic mode Making use of interoperations with other grid infrastructures Site availability/reliability metrics, accounting, support, operations meetings All this with no additional effort No impact in daily operations SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Interoperations Interoperation with OSG is day-to-day business Permanent EGEE/OSG Interoperability Platform operated by SEE region User support processes interconnected (GGUS  GOC) Accounting: Data published from Gratia to EGEE APEL repository Visualization through EGEE Accounting portal Agreed site availability/reliability metrics, stored in EGEE repository and visualized with EGEE tools NDGF interoperates with EGEE since Y2 Tests to probe the NDGF resources (arc-CEs) integrated in Service Availability Monitoring All other operations components are there: accounting, resource registration (GOCDB) Operation team from NDGF involved in EGEE grid Operator on Duty rota Interoperation with Naregi in progress Discussions started SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Accounting GRATIA DGAS SGAS EGEE accounting portal OSG Accounting Database GRATIA OSG Sites EGEE Sites EGEE accounting portal Central Accounting Database Summary Database INFN-Grid Accounting Database NDGF Accounting Database INFN-Grid Sites NDGF Sites DGAS SGAS SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

User support User Support GGUS: process and tool well established, accepted and used by the community “A problem is not a problem if a GGUS ticket was not open” Problem reporting, logging and traceability VOs directly involved in shaping GGUS Recent new features User is now involved in the final closure of a ticket New status to simplify the work of ROCs Extensive tests before the release New GGUS ticket submission form with help for problem description and other precisions Escalation reports SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Grid Operations Grid Operator on Duty Critical activity in maintaining usability and stability of sites NDGF operations team joined Portal for operations : https://cic.gridops.org Regional dashboard concept: first level support for the sites in the region Continuous work on operations procedures Contribute to establishment of regional grid infrastructures through related projects – well beyond Europe now Solid set of operational tools provided for central operations teams Good suited for the present operational model, widely used Many are shared with other infrastructure projects SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Job success rate Present job success rate between 80% - 95% Main job failure reason is site misconfiguration Two aspects to improve this: In operations: Provide sites with tools to monitor and detect the problems as soon as possible: grid monitoring and alarms at the sites Operations support and training to site managers, so they learn to solve most common problems, quick involvement from experts to solve new ones Measure and publish site reliability In applications: Application specific monitoring of the sites: application specific tests, application dashboards Select “good sites” (the ones that successfully pass the application tests) from application point of view; experience shows that this gets reliability close to 100% This is done automatically in big VOs (E.g. LHC) and manually for small ones SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Site reliability: early experience Jan 08 Feb 08 Mar 08 Apr 08 May 08 Target 93 Average – 8 best site 96 95 98 Average – all sites 90 85 91 # above target (+>90% target) 7 +3 11 +1 CERN+Tier 1s Formal reporting of Tier 2s since October 2007 #sites reporting has increased from 89  116 in May 08 Overall average: 75-80%, but top 50% (20%) of sites: 95% (98%) More than 70% of resources are at sites with >90% reliability SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Grid Operations Monitoring Service Availability Monitoring (SAM): Provides monitoring of grid services from a user perspective Main source of monitoring information for site availability calculations All information stored and displayed centrally Changes to move grid monitoring information to the sites As a part of standard site monitoring, so it can raise alarms, etc First phase: feed grid monitoring results to sites Later, standard set of sensors to be run at the sites, they will push the information to a central repository Site status monitoring: after survey, most widely used are Nagios (open source) and Lemon Prototype based on the Nagios fabric monitoring system developed within the CE ROC Enables sites to receive instant notification in case of failures Provides them with results from global monitoring systems such as SAM and Network Monitoring SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Site, regional, central monitoring SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Central probes (SAM) Local probes Network monitoring SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

VO Monitoring SAM widely used by LHC VOs, plugging their own VO-specific SAM tests, to determine which sites are suitable Experiment dashboards extensively used by the LHC community VLMED VO (biomed) using the dashboard for a year now, others interested Dashboard framework also used in other areas: Experiment specific: e.g. ATLAS production, CMS site availability Interest in for operational dashboards SAM visualization Evolution similar to grid operations monitoring: Feed VO monitoring results to the sites Common mechanism SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

GridMap GridMap – high-level visualization of the grid availability Collaboration with Industry – unfunded collaboration with EDS via CERN’s openlab project http://gridmap.cern.ch/gm Display monitoring data it in a way that operators can absorb it, using advanced visualization techniques visualize the Grid by using Treemaps (Grid + Treemap = GridMap) GridMap is a visualization tool for looking at Service Availability and Reliability Condenses all EGEE sites into a single view More important problems are visually more distinctive Used in production by grid and operators Looking at other uses of the technique and technology E.g. Showing #Jobs, data transfer rates between sites from a VO perspective SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Gridmap SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Service Level Agreements ROC-Site Service Level Description modeled on the service management recommendations of ITIL ~10 draft iterations, constructive input from both parties (ROCs and Sites), latest version: April ’08 Areas covered: Hardware and connectivity criteria Description of services covered Service hours Availability Support Service reporting and reviewing “SLAs relate to the measurement, reporting and reviewing of service quality as delivered by IT to the business”: Two ROCs have already signed SLDs with sites (South West Europe:8, South East Europe:2), others on-going. EGEE site availability metrics published since start of 2008: SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Example Report Availability of a site over a given period is the fraction of time the same was UP Reliability of a site over a given period is the fraction of time the same was UP (Availability), divided by the scheduled availability SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Operational Security Incident response Operational Security Coordination Team (OSCT)‏ Incident response EGEE Incident Response procedure for the sites Security Service Challenge 3 “fire drills” (Tier1s)‏ Procedures generally understood Difficult to apply restrictions(being followed up by the MWSG)‏ Lack of logging and traceability (being followed up by the MWSG)‏ Number of communication problems and site misconfigurations uncovered Monitoring Central security tests (SAM) detected number of insecure configurations Promote security tools usage to the sites (Nagios) Training and dissemination Produced training material and recommendations (e.g. ISSEG project)‏ Organised a security training event at EGEE 07 (successful) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Security Policies and CA Grid Security Vulnerability Group (GSVG)‏ Handling security vulnerabilities in gLite Assess the risks of discovered vulnerabilities and provide security advisories Published 25 advisories in the past year JSPG New and reworked policies in the last year Aiming at making policies more generic and simple for wider adoption at other infrastructures EUGridPMA and IGTF The European Policy Management Authority for Grid Authentication in e-Science Establish requirements and best practices for grid identity providers Enable a common trust domain applicable to authentication of end-entities Mature and successful collaboration, distributed activity SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Sustainability EGEE SA1 results: Reliable, multi-VO, large scale production infrastructure Uninterrupted service Operational processes, tools and documentation Worldwide collaboration between ROCs and sites Built together with other national and international grid infrastructures Cooperation ensures geographical growth WLCG relies heavily on the present EGEE operations service and is dependent on its future continuation. This is an assurance for the durability of the EGEE operations results. To become more sustainable, in EGEE III we want to distribute the responsibility for daily operations and more automation to reduce manpower We are setting the groundwork for the migration to an NGI based model SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Summary Infrastructure has continued to increase in size, scale, usage and reliability EGEE operations is able to cope with the increase without major changes in structure, processes or tools We have the right model Interoperation is a fact – used in production Distribution and automation, keys to reduce the effort in the coming years Setting the groundwork for the migration to an NGI based model SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008

Key documents DSA1.4: Assessment of production service status https://edms.cern.ch/document/726140 DSA1.5: Grid Operations Cookbook https://edms.cern.ch/document/726257 DSA1.6: Report on ROC progress and issues https://edms.cern.ch/document/726261 DSA1.7: Assessment of production Grid infrastructure service status https://edms.cern.ch/document/726263 Operations manual https://edms.cern.ch/document/840932 EGEE ROC-Site SLD https://edms.cern.ch/document/860386 EGEE Incident Response Procedure https://edms.cern.ch/document/867454 Virtual Organisation Operations Policy https://edms.cern.ch/document/853968 Grid Security Traceability and Logging Policy https://edms.cern.ch/document/428037 Approval of Certification Authorities https://edms.cern.ch/document/428038 Policy on Grid Multi-User Pilot Jobs https://edms.cern.ch/document/855383 SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008