Download presentation
Presentation is loading. Please wait.
Published byNelson Lewis Modified over 8 years ago
1
Daily Operations for EGEE/LCG infrastructure Hélène Cordier EGEE/LCG Operations IN2P3 Computing Centre Lyon (France) - helene.cordier@in2p3.fr
2
IHEP - Beijing - 14 Dec. 20062 LCG: LHC Computing Grid Setup the global infrastructure for simulation and processing of data for the LHC (Large Hadron Collider) experiments –Prepare, deploy and operate the computing environment for experiments to analyze the data from the LHC detectors Strategy –Integrate thousands of computers at dozens of participating institutes worldwide into a global computing resource –Rely on software developed in advanced grid technology projects, EU and US LCG : A data handling problem –40 millions collisions per second –After filtering ~100 collisions per second –1 to 10 MB of digitized data per collision Data rate: 0.1 to 1 GB/sec –10 10 recorded collisions per year
3
IHEP - Beijing - 14 Dec. 20063 LCG (organization) Tier-1 Tier-0 10 Gbps links Trigger and Data Acquisition System Tier-2 Any Tier-2 may access data on any Tier-1 Tier-1s exchange data between them Any Tier-2 may access data on any Tier-1 Tier-1s exchange data between them General Purpose/Academic/Research Network
4
IHEP - Beijing - 14 Dec. 20064 LCG/EGEE status
5
IHEP - Beijing - 14 Dec. 20065 Target applications –both academic (mainly) and industrial Pilot applications –Physics and biomedical –Selected to guide the implementation and certify the performance and functionality of the evolving infrastructure Users –5000 users (3000 at the end of year 2) from at least 5 disciplines EGEE (users)
6
IHEP - Beijing - 14 Dec. 20066 Partners organized in federations EGEE (organization)
7
Emphasize on daily operations : Monitoring dynamic components Follow-up of incidents Support Consolidation and evolution Future Work Contents
8
IHEP - Beijing - 14 Dec. 20068 Integration VDT/OSG OMII- Europe JRA1 SA3 … Testing & Certification Support, analysis, debugging Production service Operations Pre-production service Middleware providers Packaging Certification activities SA3+SA1 Deployment
9
IHEP - Beijing - 14 Dec. 20069 Repository for site information Keep a central repository of information on the components of the grid –Site registry (name, location, contact information, administrator contact, security contact, …) –Site status (candidate, uncertified, production, suspended, …) –History of scheduled unavailability of the site –Grid services operated by the site: computing elements, storage elements, file catalogue services, virtual organization management services, resource brokers, etc. –Services that sites want to be monitored by the grid operators Updating this information is a shared responsibility between the site operator and the regional operator manager LCG/EGEE –central repository of site information (a.k.a. Grid Operations Centre) developed and operated by Rutherford Appleton Laboratory (RAL), UK. –http://goc.grid-support.ac.uk/gridsite/gocdbhttp://goc.grid-support.ac.uk/gridsite/gocdb This repository is used by the grid monitoring services (more on this later)
10
IHEP - Beijing - 14 Dec. 200610 Monitoring Grid operators need to have a global view of the status of the infrastructure –Grid information is highly dynamic Tools required to collect information on the grid component state –Availability of resources and services, based on the static information stored in the central site repository –Collection of metrics on availability of resources and services LCG/EGEE –Service of probes sent to every site to check it on a regular basis –Service for regularly testing the consistency of the dynamic information published by the site in the grid information system –Information on the result of those tests is available to grid operators, site managers and end-users –Virtual Organization managers can use this information to select a set of sites they intend to use –Monitoring services developed and operated by CERN, Academia Sinica (Taiwan), GridPP (UK) and INFN (Italy)
11
IHEP - Beijing - 14 Dec. 200611 Tracking incidents Incident tracking model –Unique channel for opening tickets –Classification and assignment done by the ticket process manager –Tickets are assigned to support units –One support unit per domain of expertise Grid operators, virtual organization, regional operations centre, m/w experts, … LCG/EGEE –Central incident tracking tool developed/operated by Forschungszentrum Karlsruhe (DE) https://gus.fzk.de/ –Same tool used by grid operators and end users e-mail and web interface –Sites failing the tests receive an opened ticket Escalation procedure for solving site-related problems Involves the regional operator and the site operator Interface with ticket handling tools used by sites/federations (if needed) Tools for collecting metrics on the responsiveness of support units
12
IHEP - Beijing - 14 Dec. 200612 Operations -GGUS
13
IHEP - Beijing - 14 Dec. 200613 Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal Support Model The Regional Operations Centers, VOs and other project-wide groups such as the middleware groups (JRA), network groups (NA), service groups (SA) areJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units
14
IHEP - Beijing - 14 Dec. 200614 Use case: daily operations “Operators on duty” –Procedures –Simple workflows overview Tools and daily work –Dashboard concept and mechanisms Statistics and results –18 months ago, 12 months ago, now How do operators perform this in their daily work ? –Checking monitoring results Functional Tests Information System monitoring Services sensors –Accessing the ticketing system Create, browse, update, close –Checking sites and contacts information Tools are on different places : use of a “dashboard”!
15
IHEP - Beijing - 14 Dec. 200615 Dashboard concept Operator Ticketing system Sites info Monitoring tool #1 Monitoring tool #2 Monitoring tool #n Mail client Ticketing system Sites info Monitoring tool #1 Monitoring tool #2 Monitoring tool #n Mail sender Dashboard MANY ENTRY POINTSSINGLE ENTRY POINT
16
IHEP - Beijing - 14 Dec. 200616 Dashboard "look"
17
IHEP - Beijing - 14 Dec. 200617 Tickets workflow FZK, Karlsruhe, Germany IN2P3-CC, Lyon, France OPERATIONS PORTAL dashboard UK FRGERIT … Regional Support Units Operator on duty Ticket GGUS WSDL Problem detection & reporting Ticket follow-up
18
IHEP - Beijing - 14 Dec. 200618 Operational Procedure Manual Browse ticket Close ticket When deadline reached Problem solved ? Last escalation ? Extend deadline Suspend site Escalate Operator mail yes no yesno Follow-up and escalation
19
IHEP - Beijing - 14 Dec. 200619 Integration Tools Daily operations User Support & Ticketing system Monitoring tools Communication tools Information on sites Information on VOs Site User Regional CenterOperator IN2P3 DBGOC DB SFT G-sta t GGUS BROADCAST Operations Portal cic.in2p3.fr
20
IHEP - Beijing - 14 Dec. 200620 Putting all together Web portal for integrating all the tools and sources of operations-related information into one single place Developed and operated by CC-IN2P3 –http://cic.in2p3.fr/http://cic.in2p3.fr/ –Provides and maintains an integrated operations dashboard for grid on duty operator –Provides mechanisms for keeping information needed for appropriate hand over between operators on duty –Easy access to appropriate contact information on every actor involved in the operations of the grid –Provides communication tools
21
IHEP - Beijing - 14 Dec. 200621 Central communication Interface One “view” per type of actor : - Users - VO managers - Site managers - Regional Operation Centers - Infrastructure & Daily operations - Centralized VO handling Putting all together (cont.) Example 1: Available resources and services per region Gives information on: -Resources (CEs, SEs) - Services (RBs, RLS, LFC…) Data got from: -Site information DB (GOC-DB) - EGEE Information System (BDIIs) - Monitoring tools (SFT, gstat) Example 2: Information on Virtual Organizations - VO general information (name, discipline, …) - Contacts and administrative data (manager, experts, mailing lists, …) -Technical information (VO server, software and hardware requirements, preferred core services…) - Available resources and services
22
IHEP - Beijing - 14 Dec. 200622 Communication EGEE BROADCAST Allows mail sending to various mailing lists and contacts with a valid registered e-mail address Efficient way to quickly contact relevant targets for any kind of announcements : - Failures - Maintenances - Middleware releases - … ROC weekly reports Allows sites managers and regional managers to: -Give details on the failures they encountered during the week -Address issues to be raised during the meeting Allows Operations managers to: - Have an overview of the general state of Operations - Have in a single page a summary of all issues to be addressed Additionally, allows to produce metrics on resources availability and efficiency of monitoring tools
23
IHEP - Beijing - 14 Dec. 200623 Evolution … Once upon a time…18 months ago: No shared responsibility for operations (CERN was doing everything) No dashboard/integration tools Less than 100 sites to monitor, 45% of them were passing functional tests successfully Now: Shared responsibility (8 federations, 10 in Sept 2006) Dashboard/integration tools More than 200 sites to monitor, but 80% of them pass
24
IHEP - Beijing - 14 Dec. 200624 More Metrics… Number of tickets stable, even if number of sites increased Number of connections to the operations portal multiplied by 14 in 16 months 42 679 170 / month average
25
IHEP - Beijing - 14 Dec. 200625 Interoperability How to cope with operations problems when users simultaneously use cross- grid services? –Need to understand what and where the problems are –Who is responsible for handling and how to handle cross-grid incidents? Grid operators already defined common procedures for handling operations problems Trough ticket handling from EGEE OSG Interoperability issues are addressed in the framework of EGEE-II and started in Jun 06. Several grid projects are concerned including OSG, ARC and NAREGI.
26
IHEP - Beijing - 14 Dec. 200626 Accounting Tools needed to collect and report information on resource utilization –Intended audience: site managers, virtual organization managers, grid operators, funding agencies,… –Need to define common ways of measuring resource consumption Including usage of same units LCG/EGEE –CPU usage information (per user or per VO) provided by each site and stored in a central repository Reports (charts and numeric data) available through a web interface –Next step: collect information on storage utilization –Developed and operated by Grid Operations Centre (UK)
27
IHEP - Beijing - 14 Dec. 200627 Accounting (cont.)
28
IHEP - Beijing - 14 Dec. 200628 Security & Policy Joint Security Policy Group Certification Authorities –EUGridPMA IGTF, etc. Grid Acceptable Use Policy (AUP) –common, general and simple AUP –for all VO members using many Grid infrastructures e.g. EGEE, OSG, SEE-GRID, DEISA, national Grids… Incident Handling and Response –defines basic communications paths –defines requirements ( must s) for IR –not to replace or interfere with local response plans Security & Availability Policy Usage Rules Certification Authorities Audit Requirements Incident Response User Registration & VO Management Application Development & Network Admin Guide VO Security
29
IHEP - Beijing - 14 Dec. 200629 LGG targets and LHC specifics LHC schedule –Data acquisition starts on 2007 Global needs –37 PB/year (disk) –43 PB/year (mass storage) –105 M SpecInt2000 ~70.000 today’s fastest CPUs VO environment : specific tools and softs High Availability of of ressources and services –DataTransfers –Quasi-Real time Computing for feedback on to data acquisition Mass Data Transfers ans SLAs : File Transfer Rate, Storage
30
IHEP - Beijing - 14 Dec. 200630 Network Monitoring GGUS ENOC NREN NOC enoc.support@cern.ch Filtering tool ENOC DB update to associate NREN ticket id and GGUS ticket id ENOC ML (human) project-eu-egee-sa2-enoc-ticket@cern.ch enoc-support@ggus.org on-purpose address for ticket creation GGUS Creation of the ticket NREN ticket Request for ticket creation Creation notification Actions
31
IHEP - Beijing - 14 Dec. 200631 Common Operations model Operator submits a GGUS ticket against the Tier 1/ROC and CC’s to the site (when known) Operator- on-duty Tier1/ ROC Tier2/RC (Site) Support Unit (experts) Monitoring shows a problem Tier1/ROC and Tier2/RC work to resolve the problem If the Tier1/ROC + Tier2/RC cannot resolve the problem, the Tier1/ROC contacts the relevant Support Unit or assistance. 1 st level support 2 nd level support 3 rd level support
32
IHEP - Beijing - 14 Dec. 200632 SAM - Service Availability Monitoring
33
IHEP - Beijing - 14 Dec. 200633 Freedom of Choice for Resources
34
IHEP - Beijing - 14 Dec. 200634 Current Work Achieve a real 24x7 production quality service Improve monitoring of core services for reaching target levels for LHC production –Will benefit other scientific domains Increase diversity in applications and scientific domains Address the grid interoperability issues Interact with other regions –Latin-America, south-east Europe, Baltic countries, Mediterranean countries… China.
35
IHEP - Beijing - 14 Dec. 200635 Credits and References SFT –https://lcg-sft.cern.ch:9443/sft/lastreport.cgihttps://lcg-sft.cern.ch:9443/sft/lastreport.cgi Gstat –http://goc.grid.sinica.edu.tw/gstat/http://goc.grid.sinica.edu.tw/gstat/ GGUS –http://gus.fzk.de/http://gus.fzk.de/ GOC-DB –http://goc.grid-support.ac.uk/http://goc.grid-support.ac.uk/ SAM –http://goc.grid.sinica.edu.tw/gocwiki/Service_Availability_Monitoring_Environmenthttp://goc.grid.sinica.edu.tw/gocwiki/Service_Availability_Monitoring_Environment –https://lcg-sam.cern.ch:8443/sam/sam.cgihttps://lcg-sam.cern.ch:8443/sam/sam.cgi GridIce –http://grid.infn.it/gridicehttp://grid.infn.it/gridice Lavoisier –http://grid.in2p3.fr/lavoisierhttp://grid.in2p3.fr/lavoisier Operations Portal :http://cic.in2p3.fr/index.php?id=homehttp://cic.in2p3.fr/index.php?id=home CC-IN2P3 http://cc.in2p3.frhttp://cc.in2p3.fr EGEE http://www.eu-egee.orghttp://www.eu-egee.org LCG http://www.cern.ch/lcghttp://www.cern.ch/lcg
36
IHEP - Beijing - 14 Dec. 200636 Aknowledgements This presentation uses some slides from: – LHC Computing Grid by Fabio Hernandez CNES Workshop on Grid Utilization, Toulouse (France) - October 2005 – Operating a global grid by Fabio Hernandez Grid’5000 School, Grenoble (France) - March 2006 – EGEE/LCG: Evolution of the Operational model over the first year by H.Cordier, G.mathieu, F.Schaer, P.Nyczyk, J.Novak, M.Tsai CHEP’2006, Mumbai (India) - February 2006 – Lessons learnt from grid opeartions on EGEE/LCG by G.Mathieu Primer Taller Latino-Americano de EELA, Mérida, Venezuela - April 2006 – Monitoring Tools and Procedures by P.Nyczyk and J. Novak WLCG T2 workshop - June 2006 – EGEE Oprations by M.Barroso WLCG T2 workshop - June 2006
37
IHEP - Beijing - 14 Dec. 200637 Thanks Thanks for your hosting us
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.