EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC Comprehensive Review of LCG 19 th -20 th November 2007
Enabling Grids for E-sciencE EGEE-II INFSO-RI Outline Overview of infrastructure and usage Operations Organization & Management –ROCs etc –Support –Security aspects Monitoring –Important for improving reliability of sites Plan for operations in EGEE-III Summary LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI The EGEE Infrastructure EGEE'07; 2nd October Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Operations Advisory Group (+NA4) Joint Security Policy GroupEuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups Support Structures & Processes Training infrastructure (NA4) Training activities (NA3) Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services
Enabling Grids for E-sciencE EGEE-II INFSO-RI EIROforum DG Assembly, CERN, 15th November sites 45 countries 45,000 CPUs 12 PetaBytes > 5000 users > 100 VOs > 100,000 jobs/day Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … Grid infrastructure project co-funded by the European Commission - now in 2 nd phase with 91 partners in 32 countries
Enabling Grids for E-sciencE EGEE-II INFSO-RI EGEE infrastructure use LHCC Comprehensive Review; November > 90k jobs/day LCG >143 k jobs/day total > 90k jobs/day LCG >143 k jobs/day total Data from EGEE accounting system
Enabling Grids for E-sciencE EGEE-II INFSO-RI EGEE Operations Regional Operations Centres (ROC) –Core operations teams – provide operational oversight –One in each EGEE region Grid Operator on Duty –Teams from 10 of 11 ROCs participate In addition NDGF want to participate –5-weekly rotations: each week 1 team primary and 1 team backup –Critical activity in maintaining usability and stability of sites –Important tools Site Availability Tests (SAM) Information system monitoring GGUS system for trouble ticket management Operations portal ( access to all operational tools and EGEE Network Operations Centre (ENOC) –Provides link between EGEE grid operations and GEANT/NRENs –For LHCOPN process underway to define operations procedures and interfaces to grid operations (via ENOC for EGEE) LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Operations Progress Successful releases of major updates to many central operations services (GOCDB, CIC Portal, GGUS) –CIC Portal new features include raising of alarms and masking of unnecessary alarms –RSS feed for CIC Portal alarms so that site administrators can monitor their own sites –Major update to GOCDB which included many new, useful features Implementation of failover for most central operations services –Still needed for GOC database –improvements still needed for other operations services (for example CIC Portal) LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Operations Progress Implementation of a formalized grid middleware release processes –Moved from “big bang” releases to incremental updates –Formal, documented process now in place which is handled by teams rather than single-point-of-failure individuals Process implemented to track most urgent/important grid issues by the ROCs. –These are passed to the TCG where appropriate and have resulted in significant improvements, for example standardization and improvement of middleware logging. Interoperability with OSG in production –CMS now submit jobs to both grids (EGEE and OSG) through a single WMS LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Pre-production service is now ~ 27 sites in 16 countries Provides access to some 3000 CPU –Some sites allow access to their full production batch systems for scale tests Sites install and test different configurations and sets of services Weekly update cycle Try to get good feeling for the quality of the release or updates before general release to production Larger sites gain experience on PPS before going to production. Pre-production service 9 LHCC Comprehensive Review; November 2007 The service is not used at the level that was foreseen Many issues for LHC experiments –Lack of effort –Difficulty to test complex software stacks outside production environment Discussing how to best use for various needs
Enabling Grids for E-sciencE EGEE-II INFSO-RI User Support GGUS – Grid User support –Provides infrastructure and staff to follow problems –Distributed ticketing system – links to ROC and other ticket systems Several improvements as requested by users: –new search engine –ticket linking –subscription to tickets –local helpdesks –Reporting tools Bidirectional interface with OSG user support TPM first line support works smoothly now Clear distinction between Services and Software Support Units Still responsiveness issues when problems leave the influence sphere of grid operations LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI User Support Communication: With all Grid Sites, including OSG, weekly at the Operations meeting. Monthly meeting with ROCs, VOs and GGUS developers Discussions at CHEP’07 on better integration of VOs in the overall support effort Workshops to establish and improve connection between grid and VO user support LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Operational Security (OSCT) Successes: –OSCT now well established – a member per ROC –Duty Coordinator – weekly shifts –OSCT provided its first security training event during EGEE’07 Issues: –OSCT is looking for additional experts to contribute to its activities In EGEE-III all ROCs contribute 1 FTE Progress: –OSCT is gradually introducing SAM Security tests to check for known security issues at the sites uses special tests, securely transported and visible only to OSCT –Ongoing security service challenges – phase 3 underway –Bi-lateral contact EGEE – OSG security ops recently established LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Security Policies LHCC Comprehensive Review; November Security Policy Site & VO Policies Certification Authorities Audit Requirements Incident Response Accounting Data Privacy Grid Services including pilot jobs Grid & VO AUPs New in 2007
Enabling Grids for E-sciencE EGEE-II INFSO-RI Security Policy (JSPG) Updated the top-level Security Policy to make simpler and more general. –generalisation and simplification of the policies has been needed to achieve interoperable (identical) policies between EGEE, OSG, NDGF and others New policies: Site Operations, VO Operations, Pilot Jobs –Sites have to accept and sign the Site Operations –VOs have to accept and sign the VO Operations Also working on –Accounting Data Privacy, –Logging and Traceability (to replace Audit) and –Portals LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Vulnerability handling (GSVG) Policy and procedures have been approved by EGEE –This allows the disclosure of issues concerning EGEE middleware when they reach the Target Date for resolution The Risk Assessment team handles Security Vulnerability issues and carries out Risk Assessments –Target Date for publication set according to risk Status at end of Sep 2007… –Since GSVG started (end 2005): –122 issues analysed (1 – 2 per week) 62 open (42 are sw bugs); 60 closed (25 bug fixes, 7 operational) 1 extremely critical, 9 high risk (2 open) LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Coordination & Communication Weekly operations meeting –WLCG, EGEE, OSG – reviews all operational issues Operations workshops and WLCG workshops –December 2006 – WLCG Tier 2 workshop (India) –January 2007 – WLCG Collaboration workshop –June 2007 – EGEE/OSG operations workshop –September 2007 – WLCG Collaboration workshop –November 2007 – WLCG Service Reliability workshop Bi-weekly service coordination meeting (CERN) EGEE internal coordination: –Bi-weekly ROC managers’ meeting LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Monitoring & Tools Monitoring working groups: Grid service monitoring, and fabric monitoring for small sites Experiment Dashboards Fabric management (best practices and tools) Set up via HEPiX to pool resources from system managers Overall coordination to ensure commonality where possible and avoid duplication LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM Migrated from SFT to SAM –Infrastructure to run tests against grid services at a site or central services –Test results accumulated in a database –Big improvements in standardizing the framework. anyone can now easily contribute tests now easier for people to run their own instance of the service –SAM now used in one way or another by all the LHC experiments –Can equally be used to run experiment-specific services Derives: –Site and service availability and reliability metrics For agreed set of services as a general measure For experiment-specified set (including experiment-specific) as a measure per experiment Allows: –Experiments to dynamically select sites that: Fulfill their specific availability criteria Blacklist, whitelist sites Raise alarms against sites (soon …) - currently a ticket gets opened and ROC follows up on a failing site LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI WLCG Grid Monitoring Landscape LHCC Comprehensive Review; November local resources Grid Middleware Grid Applications central services site services site Local monitoring Lemon/SLS Nagios Ganglia... GStat SAM/GridView GridICE GridPP Real Time Monitor... Experiment Dashboards... Grid Services monitoring Application monitoring DomainMonitoring Tools in use
Enabling Grids for E-sciencE EGEE-II INFSO-RI High Level model LHCC Comprehensive Review; November LEMON Nagios SAM R-GMA SAME GridView Experiment Dashboard GridIce HTTP LDAP GOCDB Dashboard GridView GridMap See for details
Enabling Grids for E-sciencE EGEE-II INFSO-RI Grid Site Monitoring principles Provide an easily extensible site monitoring system –Or be able to plug grid features into existing site monitoring Should be able to provide (or augment) alarms at the site for the grid services Don’t force a solution on the site administrators –Should work with any fabric monitoring system that provides basic functionality Provide the specific plugins to deal with the Grid Enable export of the data from the site into standard grid monitoring systems e.g. SAM, GridView, GridICE,… –Avoid duplicate running of probes LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Goals Bring in data from existing monitoring systems inside the site monitoring tools –Service Availability Monitoring (SAM) –Network performance monitoring (NPM) –Experiment site blacklists (FCR tool) –Experiment dashboards, … Prototype available based on Nagios –Nagios widely used in the community Now integrate with LEMON –As next most common solution LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI LHCC Comprehensive Review; November GridMap Prototype Visualization Metric selection for colour of rectangles Show SAM status Show GridView availability data Grid topology view (grouping) Metric selection for size of rectangles VO selection Overall Site or Site Service selection Link: Drilldown into region by clicking on the title Context sensitive information Colour Key Description of current view
Enabling Grids for E-sciencE EGEE-II INFSO-RI LHCC Comprehensive Review; November GridMap: Link to Existing Tools Clicking on a site opens a page with details in GridView/SAM Site Detail Availability SAM Test Results
Enabling Grids for E-sciencE EGEE-II INFSO-RI Plans for EGEE-III No major change in operational procedures All 11 ROCS will participate in –Operator on duty –GGUS ticket processing –Operational security team Effective level of funding for SA1 (grid operations) –Is likely to be ~25% less than in EGEE-II –Implies must be more effective with less operational staff –NB. Tier 1s and many Tier 2s rely on this effort for grid operations support During EGEE-III must plan for improved efficiency of operations –Emphasis on continually improving monitoring automation of alarms –Move towards sites monitoring and generating alarms rather than central “oversight” LHCC Comprehensive Review; November
Enabling Grids for E-sciencE EGEE-II INFSO-RI Summary EGEE Grid operations are becoming mature –Well established procedures, evolving with experience –Good collaboration with OSG on interoperations Operations and user support, monitoring, etc. Monitoring tools are critical elements –SAM – widely used by operations teams and experiments –Availability/reliability of Tier 1s and now Tier 2s monitored for MoU compliance –SAM tests can now raise alarms at sites with work done on monitoring infrastructure Focus on improving stability and reliability of services in preparation for CCRC’08 and LHC start-up LHCC Comprehensive Review; November