6 th CIC on Duty meeting Lyon 27-29/03/2006 Enabling Grids for E-sciencE Grid INTER-Operations Hélène Cordier EGEE/WLCG Operations IN2P3 Computing Centre Lyon (France) -
2 Enabling Grids for E-sciencE Contents Existing Common Interests in solving mainly 2 issues so far: –Security and accounting issues, monitoring workflow efforts are diverse. Existing efforts at inter-project level involving: –Grid Interoperability Now (GIN, as a workgroup from OGF) Existing efforts at project level involving: –EGEE, WLCG and OSG –NDGF, PRAGMA, TERAGRID and NAREGI Existing efforts at IN2P3-CC: –IGTMD Concerns and Updates
3 Enabling Grids for E-sciencE Security & Policy Joint Security Policy Group Certification Authorities – EUGridPMA IGTF and so one. Grid Acceptable Use Policy (AUP) – common, general and simple AUP – for all VO members using many Grid infrastructures e.g. EGEE, OSG, SEE-GRID, DEISA, national Grids… Incident Handling and Response – defines basic communications paths – defines requirements ( must s) for IR – not to replace or interfere with local response plans Security & Availability Policy Usage Rules Certification Authorities Audit Requirements Incident Response User Registration & VO Management Application Development & Network Admin Guide VO Security Grid Security Policy (v5.7) : Grid Site Operations Policy (v1.4): Virtual Organisation Operations Policy (v1.0):
4 Enabling Grids for E-sciencE Usage record working group Mandate : In order for resources to be shared, sites must be able to exchange basic accounting and usage data in a common format. This working group proposes to define a common usage record based on those in current practice. The record format will be specific enough to facilitate information sharing among grid sites, yet general enough that the usage data can be used for a variety of purposes - traditional usage accounting, service usage monitoring, perfomance tuning, etc. This group will therefore be concentrating on collecting and disseminating resource consumption data. We will not be addressing how that data is to be collected by the resource sites, nor how it will be used by its recipients.
5 Enabling Grids for E-sciencE Accounting Tools needed to collect and report information on resource utilization – Intended audience: site managers, virtual organization managers, grid operators, funding agencies,… – Need to define common ways of measuring resource consumption Including usage of same units LCG/EGEE – CPU usage information (per user or per VO) provided by each site and stored in a central repository : Reports (charts and numeric data) available through a web interface – Next step: collect information on storage utilization. – Developed and operated by Grid Operations Centre (UK) and CESGA (SWE).
6 Enabling Grids for E-sciencE Accounting – Cont’d
7 Enabling Grids for E-sciencE Accounting
8 Enabling Grids for E-sciencE Accounting
9 Enabling Grids for E-sciencE High-Level Model Site monitoring
10 Enabling Grids for E-sciencE Site monitoring (cont’d) We can’t/won’t impose a solution on sites, as they might/should have something Already. Specification based approach allows our probes fit into any fabric monitoring system : Data Exchange format allows higher-level services consume the data regardless of fabric monitoring system WLCG Monitoring Working Groups since January 23 rd 2007: System Management Working Group – SMWG /J. Casey, I. Neilson Grid Service Monitoring Working Group – GSMWG / A. Forti, M. Jouvin System Analysis Working Group – SAWG / J. Andreeva, P. Saiz [Rob Quick, Workshop on Grid services Monitoring HPDC’07 – June 27th 2007]
11 Enabling Grids for E-sciencE CMS Dashboard 1/2
12 Enabling Grids for E-sciencE CMS Dashboard 2/2
13 Enabling Grids for E-sciencE CIC Operations Portal Web portal for integrating all the tools and sources of operations-related information into one single place Developed and operated by CC-IN2P3, failover instance at CNAF – – Provides and maintains an integrated operations dashboard for grid on duty operator – Provides mechanisms for keeping information needed for appropriate hand over between operators on duty – Easy access to appropriate contact information on every actor involved in the operations of the grid – Provides communication tools
14 Enabling Grids for E-sciencE Alarms Dashboard
15 Enabling Grids for E-sciencE Opening tickets
16 Enabling Grids for E-sciencE Tracking incidents via GGUS Incident tracking model –Unique channel for opening tickets End-users : e.g job submission failures, data transfer failed Operators : e.g job submission failures –Classification and 1rst assignment done by the ticket process manager –Tickets are assigned to support units - one per domain of expertise Grid operators, applications, federations, m/w experts,.. OSG : Automatic helpdesk/ XML Format Exchange 4 tickets created by cms users from June 27th WLCG/EGEE –Central incident tracking tool : –Same tool used by grid operators and end users via and web interface –Sites failing the tests receive are assigned a ticket Escalation procedure for solving site-related problems Involves the regional operator and the site operator Interface with ticket handling tools used by sites/federations (if needed) Tools for collecting metrics on the responsiveness of support units
17 Enabling Grids for E-sciencE The ENOC The EGEE Network Operations Centre (ENOC): –Single point of contact between EGEE and the NRENs –Where EGEE and the network can exchange operational information –Network support unit in GGUS ENOC
18 Enabling Grids for E-sciencE IGTMD Grid Interoperability and Massive Data Transfer 3 years, started in Feb 2006 Renater, ENS, CC-IN2P3, FNAL-unfunded Goals 1.Disk to disk Bulk data transfer 2.Replication and referring mechanisms 3.Information Sytem and job management interoperability 4.Grid control and monitoring 5.Usage of statistics and accounting data
19 Enabling Grids for E-sciencE IGTMD Roadmap Network: items 1and 2 –2* 1 Gb/s CC-IN2P3/FNAL on October 16th 2006 – LCG/EGEE –Tests on Massive Data transfer – CC-IN2P3/FNAL Interoperability: item 3 –Access to grid resources through standard APIs – LCG/EGEE –State-of-the art cf. JTR – October17th; –RoadMap on the IGTMD face-to face meeting May 4th Inter-operations: items 4 to 5 –Tests suite relevancy to US sites – EGEE –Operations and Daily Monitoring of services – EGEE –Usage Records and accounting – OGF
20 Enabling Grids for E-sciencE Concerns and updates Achieve a real 24x7 production quality-like service : Failover mechanisms Increase automation of daily monitoring tools and alarms treatment. OGF20—GIN JOBS - EGEE/TERAGRID/OSG/NORDUGRID/DEISA nvironmentOGF20https://forge.ogf.org/sf/wiki/do/viewPage/projects.gin/wiki/WorkerNodeE nvironmentOGF20 29/08/ /03/2007 mail from Laurence Field on GIN-JOB GIN-OPS : Savannah and Ninf-G GIN-IS :EGEE-NDGF and EGEE-OSG not updated since 17 Août 2006 GIN-data :idem GIN-auth : AUP for the gin.gg.org VO since 12/06.
21 Enabling Grids for E-sciencE Credits and References Gstat – GGUS – GOC-DB – SAM – – CMS Dashboard – GridIce – Lavoisier – CIC Operations Portal EGEE WLCG Slides from : Ian Bird - OGF/EGEE User Forum - May 9th 2007 Rob Quick, Workshop on Grid services Monitoring HPDC’07 – June 27th 2007
22 Enabling Grids for E-sciencE Links SAM/GridView Monitoring Portal: TWiki: SAM OSG Probe Dev Homepage: (Service Availability Monitor) Test Page: TWiki: GridICE Monitoring Portal: Documentation: Experiment Dashboard Portal: TWiki: GridPP Real Time Monitor Homepage: (2D map and 3D globe visualizations) GStat Portal: TWiki: Lemon Portal (CERN Compute Center): Documentation: