EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Dr. Ian Bird CERN SA1 Activity Manager EGEE’07.
EGEE-II INFSO-RI Enabling Grids for E-sciencE AP ROC Min-Hong Tsai ASGC SA1 Transition Meeting May 8 th, 2008
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse EGEE’s plans for transition.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks PPS All sites Meeting: Introduction & Agenda.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
INFSO-RI Enabling Grids for E-sciencE Plan until the end of the project and beyond, sustainability plans Dieter Kranzlmüller Deputy.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Next steps with EGEE EGEE training community.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird SA1 Activity Leader IT Department,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Report from GGUS BoF Session at the WLCG.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Antonio Retico CERN, Geneva 19 Jan 2009 PPS in EGEEIII: Some Points.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Dashboard Cyril L’Orphelin - CNRS/IN2P3.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Ian Bird LCG Project Leader On the transition to EGI – Requirements from WLCG WLCG Workshop 24 th April 2008.
WLCG Laura Perini1 EGI Operation Scenarios Introduction to panel discussion.
Julia Andreeva on behalf of the MND section MND review.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Alistair.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks UK-Ireland-France Regional Participation.
INFSO-RI Enabling Grids for E-sciencE User and Virtual Organisation Support in EGEE Flavia Donno, CERN Torsten Antoni, FZK Alistair.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Technical Overview EGEE-II’s achievements.
Components Selection Validation Integration Deployment What it could mean inside EGI
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
Setting up NGI operations Ron Trompert EGI-InSPIRE – ROD teams workshop1.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks IT ROC: Vision for EGEE III Tiziana Ferrari.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays.
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
SA1 Status Report EGEE Grid Operations & Management
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Leigh Grundhoefer Indiana University
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC Comprehensive Review of LCG 19 th -20 th November 2007

Enabling Grids for E-sciencE EGEE-II INFSO-RI Outline Overview of infrastructure and usage Operations Organization & Management –ROCs etc –Support –Security aspects Monitoring –Important for improving reliability of sites Plan for operations in EGEE-III Summary LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI The EGEE Infrastructure EGEE'07; 2nd October Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Operations Advisory Group (+NA4) Joint Security Policy GroupEuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups Support Structures & Processes Training infrastructure (NA4) Training activities (NA3) Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services

Enabling Grids for E-sciencE EGEE-II INFSO-RI EIROforum DG Assembly, CERN, 15th November sites 45 countries 45,000 CPUs 12 PetaBytes > 5000 users > 100 VOs > 100,000 jobs/day Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … Grid infrastructure project co-funded by the European Commission - now in 2 nd phase with 91 partners in 32 countries

Enabling Grids for E-sciencE EGEE-II INFSO-RI EGEE infrastructure use LHCC Comprehensive Review; November > 90k jobs/day LCG >143 k jobs/day total > 90k jobs/day LCG >143 k jobs/day total Data from EGEE accounting system

Enabling Grids for E-sciencE EGEE-II INFSO-RI EGEE Operations Regional Operations Centres (ROC) –Core operations teams – provide operational oversight –One in each EGEE region Grid Operator on Duty –Teams from 10 of 11 ROCs participate  In addition NDGF want to participate –5-weekly rotations: each week 1 team primary and 1 team backup –Critical activity in maintaining usability and stability of sites –Important tools  Site Availability Tests (SAM)  Information system monitoring  GGUS system for trouble ticket management Operations portal ( access to all operational tools and informationwww.gridops.org EGEE Network Operations Centre (ENOC) –Provides link between EGEE grid operations and GEANT/NRENs –For LHCOPN process underway to define operations procedures and interfaces to grid operations (via ENOC for EGEE) LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Operations Progress Successful releases of major updates to many central operations services (GOCDB, CIC Portal, GGUS) –CIC Portal new features include raising of alarms and masking of unnecessary alarms –RSS feed for CIC Portal alarms so that site administrators can monitor their own sites –Major update to GOCDB which included many new, useful features Implementation of failover for most central operations services –Still needed for GOC database –improvements still needed for other operations services (for example CIC Portal) LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Operations Progress Implementation of a formalized grid middleware release processes –Moved from “big bang” releases to incremental updates –Formal, documented process now in place which is handled by teams rather than single-point-of-failure individuals Process implemented to track most urgent/important grid issues by the ROCs. –These are passed to the TCG where appropriate and have resulted in significant improvements, for example standardization and improvement of middleware logging. Interoperability with OSG in production –CMS now submit jobs to both grids (EGEE and OSG) through a single WMS LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Pre-production service is now ~ 27 sites in 16 countries Provides access to some 3000 CPU –Some sites allow access to their full production batch systems for scale tests Sites install and test different configurations and sets of services Weekly update cycle Try to get good feeling for the quality of the release or updates before general release to production Larger sites gain experience on PPS before going to production. Pre-production service 9 LHCC Comprehensive Review; November 2007 The service is not used at the level that was foreseen Many issues for LHC experiments –Lack of effort –Difficulty to test complex software stacks outside production environment Discussing how to best use for various needs

Enabling Grids for E-sciencE EGEE-II INFSO-RI User Support GGUS – Grid User support –Provides infrastructure and staff to follow problems –Distributed ticketing system – links to ROC and other ticket systems Several improvements as requested by users: –new search engine –ticket linking –subscription to tickets –local helpdesks –Reporting tools Bidirectional interface with OSG user support TPM first line support works smoothly now Clear distinction between Services and Software Support Units Still responsiveness issues when problems leave the influence sphere of grid operations LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI User Support Communication: With all Grid Sites, including OSG, weekly at the Operations meeting. Monthly meeting with ROCs, VOs and GGUS developers Discussions at CHEP’07 on better integration of VOs in the overall support effort Workshops to establish and improve connection between grid and VO user support LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Operational Security (OSCT) Successes: –OSCT now well established – a member per ROC –Duty Coordinator – weekly shifts –OSCT provided its first security training event during EGEE’07 Issues: –OSCT is looking for additional experts to contribute to its activities  In EGEE-III all ROCs contribute 1 FTE Progress: –OSCT is gradually introducing SAM Security tests to check for known security issues at the sites  uses special tests, securely transported and visible only to OSCT –Ongoing security service challenges – phase 3 underway –Bi-lateral contact EGEE – OSG security ops recently established LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Security Policies LHCC Comprehensive Review; November Security Policy Site & VO Policies Certification Authorities Audit Requirements Incident Response Accounting Data Privacy Grid Services including pilot jobs Grid & VO AUPs New in 2007

Enabling Grids for E-sciencE EGEE-II INFSO-RI Security Policy (JSPG) Updated the top-level Security Policy to make simpler and more general. –generalisation and simplification of the policies has been needed to achieve interoperable (identical) policies between EGEE, OSG, NDGF and others New policies: Site Operations, VO Operations, Pilot Jobs –Sites have to accept and sign the Site Operations –VOs have to accept and sign the VO Operations Also working on –Accounting Data Privacy, –Logging and Traceability (to replace Audit) and –Portals LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Vulnerability handling (GSVG) Policy and procedures have been approved by EGEE –This allows the disclosure of issues concerning EGEE middleware when they reach the Target Date for resolution The Risk Assessment team handles Security Vulnerability issues and carries out Risk Assessments –Target Date for publication set according to risk Status at end of Sep 2007… –Since GSVG started (end 2005): –122 issues analysed (1 – 2 per week)  62 open (42 are sw bugs); 60 closed (25 bug fixes, 7 operational)  1 extremely critical, 9 high risk (2 open) LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Coordination & Communication Weekly operations meeting –WLCG, EGEE, OSG – reviews all operational issues Operations workshops and WLCG workshops –December 2006 – WLCG Tier 2 workshop (India) –January 2007 – WLCG Collaboration workshop –June 2007 – EGEE/OSG operations workshop –September 2007 – WLCG Collaboration workshop –November 2007 – WLCG Service Reliability workshop Bi-weekly service coordination meeting (CERN) EGEE internal coordination: –Bi-weekly ROC managers’ meeting LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Monitoring & Tools Monitoring working groups:  Grid service monitoring, and fabric monitoring for small sites  Experiment Dashboards  Fabric management (best practices and tools)  Set up via HEPiX to pool resources from system managers  Overall coordination to ensure commonality where possible and avoid duplication LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM Migrated from SFT to SAM –Infrastructure to run tests against grid services at a site or central services –Test results accumulated in a database –Big improvements in standardizing the framework.  anyone can now easily contribute tests  now easier for people to run their own instance of the service –SAM now used in one way or another by all the LHC experiments –Can equally be used to run experiment-specific services Derives: –Site and service availability and reliability metrics  For agreed set of services as a general measure  For experiment-specified set (including experiment-specific) as a measure per experiment Allows: –Experiments to dynamically select sites that:  Fulfill their specific availability criteria  Blacklist, whitelist sites  Raise alarms against sites (soon …) - currently a ticket gets opened and ROC follows up on a failing site LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI WLCG Grid Monitoring Landscape LHCC Comprehensive Review; November local resources Grid Middleware Grid Applications central services site services site Local monitoring Lemon/SLS Nagios Ganglia... GStat SAM/GridView GridICE GridPP Real Time Monitor... Experiment Dashboards... Grid Services monitoring Application monitoring DomainMonitoring Tools in use

Enabling Grids for E-sciencE EGEE-II INFSO-RI High Level model LHCC Comprehensive Review; November LEMON Nagios SAM R-GMA SAME GridView Experiment Dashboard GridIce HTTP LDAP GOCDB Dashboard GridView GridMap See for detailshttps://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702-WLCG_Monitoring_for_Managers.pdf

Enabling Grids for E-sciencE EGEE-II INFSO-RI Grid Site Monitoring principles Provide an easily extensible site monitoring system –Or be able to plug grid features into existing site monitoring Should be able to provide (or augment) alarms at the site for the grid services Don’t force a solution on the site administrators –Should work with any fabric monitoring system that provides basic functionality Provide the specific plugins to deal with the Grid Enable export of the data from the site into standard grid monitoring systems e.g. SAM, GridView, GridICE,… –Avoid duplicate running of probes LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Goals Bring in data from existing monitoring systems inside the site monitoring tools –Service Availability Monitoring (SAM) –Network performance monitoring (NPM) –Experiment site blacklists (FCR tool) –Experiment dashboards, … Prototype available based on Nagios –Nagios widely used in the community Now integrate with LEMON –As next most common solution LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI LHCC Comprehensive Review; November GridMap Prototype Visualization Metric selection for colour of rectangles Show SAM status Show GridView availability data Grid topology view (grouping) Metric selection for size of rectangles VO selection Overall Site or Site Service selection Link: Drilldown into region by clicking on the title Context sensitive information Colour Key Description of current view

Enabling Grids for E-sciencE EGEE-II INFSO-RI LHCC Comprehensive Review; November GridMap: Link to Existing Tools Clicking on a site opens a page with details in GridView/SAM Site Detail Availability SAM Test Results

Enabling Grids for E-sciencE EGEE-II INFSO-RI Plans for EGEE-III No major change in operational procedures All 11 ROCS will participate in –Operator on duty –GGUS ticket processing –Operational security team Effective level of funding for SA1 (grid operations) –Is likely to be ~25% less than in EGEE-II –Implies must be more effective with less operational staff –NB. Tier 1s and many Tier 2s rely on this effort for grid operations support During EGEE-III must plan for improved efficiency of operations –Emphasis on continually improving monitoring  automation of alarms –Move towards sites monitoring and generating alarms rather than central “oversight” LHCC Comprehensive Review; November

Enabling Grids for E-sciencE EGEE-II INFSO-RI Summary EGEE Grid operations are becoming mature –Well established procedures, evolving with experience –Good collaboration with OSG on interoperations  Operations and user support, monitoring, etc. Monitoring tools are critical elements –SAM – widely used by operations teams and experiments –Availability/reliability of Tier 1s and now Tier 2s monitored for MoU compliance –SAM tests can now raise alarms at sites with work done on monitoring infrastructure Focus on improving stability and reliability of services in preparation for CCRC’08 and LHC start-up LHCC Comprehensive Review; November