INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Wofgang Thöne, Institute For Scientific Computing – EGEE-Meeting August 2004 Welcome to the User.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Pilot Test-bed Operations and Support Work.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Romanian SA1 report Alexandru Stanciu ICI.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
OGF 25/EGEE User Forum Catania, March 2 nd 2009 Meta Scheduling and Advanced Application Support on the Spanish NGI Enol Fernández del Castillo (IFCA-CSIC)
SEE-GRID-SCI Regional Grid Infrastructure: Resource for e-Science Regional eInfrastructure development and results IT’10, Zabljak,
L ABORATÓRIO DE INSTRUMENTAÇÃO EM FÍSICA EXPERIMENTAL DE PARTÍCULAS Enabling Grids for E-sciencE Grid Computing: Running your Jobs around the World.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE – paving the way for a sustainable infrastructure.
INFSO-RI Enabling Grids for E-sciencE Plan until the end of the project and beyond, sustainability plans Dieter Kranzlmüller Deputy.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Next steps with EGEE EGEE training community.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
INFSO-RI Enabling Grids for E-sciencE Introduction to Grid Computing, EGEE and Bulgarian Grid Initiatives - Plovdiv,
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE’09 SSC Workshop SAFE Project Proposal.
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract INFSO-RI Grid Accounting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Operations Automation Team KoM, May ROC VIEW (SWE)‏ Javier Lopez Cacheiro/
EGEE-III INFSO-RI Enabling Grids for E-sciencE Antonio Retico CERN, Geneva 19 Jan 2009 PPS in EGEEIII: Some Points.
EGEE is a project funded by the European Union under contract IST Support in EGEE Ron Trompert SARA NEROC Meeting, 28 October
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks South-West Federation Gabriel Amorós (CSIC)
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks NA3 work in the SWE Federation Antonio Fuentes.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Grid Activities in Portugal Gonçalo Borges Jornadas LIP 2010 Braga, Janeiro 2010.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
Dr. Isabel Campos Plasencia (IFCA-CSIC) Spanish NGI Coordinator ES-GRID The Spanish National Grid Initiative.
INFSO-RI Enabling Grids for E-sciencE Introduction to Grid Computing, EGEE and Bulgarian Grid Initiatives, Sofia, South.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number GGUS Service Provider GGUS –
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid is a Bazaar of Resource Providers and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
INFN-Grid WS, Bari, 2004/10/15 Andrea Caltroni, INFN-Padova Marco Verlato, INFN-Padova Andrea Ferraro, INFN-CNAF Bologna EGEE User Support Report.
1 The Life-Science Grid Community Tristan Glatard 1 1 Creatis, CNRS, INSERM, Université de Lyon, France The Spanish Network for e-Science 2/12/2010.
Bob Jones EGEE Technical Director
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
SA1 Execution Plan Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Gonçalo Borges on behalf of LIP
The CCIN2P3 and its role in EGEE/LCG
Nordic ROC Organization
LCG Operations Workshop, e-IRG Workshop
Leigh Grundhoefer Indiana University
Presentation transcript:

INFSO-RI Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges, M. David, N. Dias, J. Gomes, J. P. Martins LIP: Laboratório de Instrumentação em Física Experimental de Partículas C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. Pacheco PIC: Port d’Informació Científica F. Bernabé, J. Fontán, J. Lopez, P. Rey CESGA: Fundación Centro Tecnológico de Supercomputación de Galicia R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas J. Palacios IFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 2 Enabling Grids for E-sciencE INFSO-RI Outline oThe EGEE grid project. oMain operation activities inside EGEE South-West grid infrastructure: –Resources; –Activities coordination:  Certification; Sites and middleware certification;  Accounting; EGEE View Participation in the Accounting Enforcement task;  Monitoring; Interaction with the Grid Operation Centre (GOC); Participation in COD;  Support; Interaction with the Global Grid User Support (GGUS);  Authentication and Security; Activities in the EUGridPMA framework.  Middleware tests and integration.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 3 Enabling Grids for E-sciencE INFSO-RI EGEE project oThe Enabling Grids for E-sciencE project: –An European financed grid project; –The biggest world wide grid for multi-disciplinary sciences;  Integrates several national and regional grids;  More then 90 partners distributed over 32 countries; –Developed on top of the infrastructures and software built in EDG and LCG grid projects. oThe LHC Computing Grid project: –LHC will be the world most powerful particle accelerator;  Built at CERN and expected to start operating in 2007; –LCG aims to build and maintain a data storage and analysis infrastructure for the large LHC physics community:  15 Petabytes of experimental data annually,  Available during the 15 years life time of the LHC machine;  Fully accessible to ~5000 scientists from more than 500 institutes.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 4 Enabling Grids for E-sciencE INFSO-RI EGEE project oEGEE concentrates in three core areas: –Improve and maintain the middleware;  Provide a reliable service; –Attract new users from industry as well as from science;  Ensure they receive high standard of training and support; –Combine national, regional and thematic Grid efforts;  For a seamless Grid infrastructure for scientific research and to build a sustainable Grid for business research and industry. oEGEE has expanded from the originally two scientific field (High energy physics and life sciences) and now integrates applications from other scientific fields: –Astrophyics; Biomedic and Bioinformatic applications; –Computational chemistry; Earth Sciencies; –Finance; Fusion; Geophysics; –(...) oEGEE supports more than 100 virtual organizations.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 5 Enabling Grids for E-sciencE INFSO-RI EGEE project

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 6 Enabling Grids for E-sciencE INFSO-RI EGEE Operations: The GOC oThe Grid Operations Centre is responsible for coordinating the overall operation of the EGEE Grid:Grid Operations Centre –Devises and manages mechanisms and procedures which encourage optimal operation of the Grid; –It acts as a central point of operational information such as:  Site local and central services;  Site resources configuration;  Contact details. –Monitores the operation of the Grid Infrastructure as a whole;  GOC works with the federation local support groups to assist them in providing the best possible service while their infrastructure is connected to the Grid.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 7 Enabling Grids for E-sciencE INFSO-RI EGEE Operations: The ROCs oThe fulfillment of the federations key objectives is supervised by the Regional Operation Centre (ROC): –Operate essential core services;  RBs, data management services, information services, VOMS servers; –Interface between VO requests and sites resources; –To provide monitoring and operational troubleshooting services; –Receiving, responding and coordinating the resolution of grid operation problems from the sites and users point of view. –South-Western Europe –France –UK/Ireland –Northern Europe –Germany/Switzerland –CERN –Italy –Central Europe –South Eastern Europe –Russia –Asia/Pacific

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 8 Enabling Grids for E-sciencE INFSO-RI South-West federation oEGEE South-West federation is part of the European Grid Operation, Support and Management activity (SA1).EGEE South-West federation oResponsible for maintaining high quality services of the grid infrastructure inside the South-West region: –Portuguese: LIP; –Spanish: CESGA, CSIC, PIC, CIEMAT, BIFI; –PIC is the “Tier 1” centre of the SWE federation. oThe EGEE SWE ROC is shared among the different institutes: –This requires a higher coordination effort;  All operations/management questions are weekly reported to the ROC manager during a VRVS meeting;  Promotes the communication between the different site managers;  Promotes the knowledge exchange necessary for a faster resolution of problems.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 9 Enabling Grids for E-sciencE INFSO-RI South-West federation resources oEGEE South-West federation is presently offering… –Core services for the production testbed (13/10/2006):  8 Resource Brokers;  8 top BDII machines;  3 LFC central catalogs;  1 FTS service. –Local services for the production infrastructure:  18 Computing Elements; 1052 CPUs = Normalized CPUs. o(Norm = 1000 SpecInts2000 = Pentium 2.8 GHz).  18 Storage Elements; 35.4 TB of online storage (disk); 1.5 PB of nearline storage (tape backend). –These resources are currently shared according to the federation internal policies by more than 20 virtual organizations.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 10 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Site certification oThe SWE ROC is responsible for certifying if a site fulfills the necessary requirements to join the grid production infrastructure: –Performed by LIP in Portugal; –Performed by PIC in Spain; – The certification process consists on a set of demanding tests:  Information system;  Site configuration;  Interactions with the central core services. –ROC negotiates service level agreements (SLA’s):  Settle the level of services each Resource Center (RC) should provide to the infrastructure.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 11 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting oThe EGEE South-West federation was one of the first to widely deploy grid accounting tools; –CESGA is the responsible entity inside the South-West federation for maintaining the accounting portal;accounting portal –The most relevant information is monthly compiled and reported to the ROC and federation members. oDue to its expertise, CESGA was proposed as the responsible entity to handle the “Accounting enforcement task”… –Monitor all the EGEE infrastructure; –Check if all the Resource Centres are publishing correct accounting information and open tickets if they don’t; –Help the Resource Centres to deploy the necessary accounting tools; o… and take charge of the “EGEE View”: –Portal with accounting information from all EGEE sites.Portal

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 12 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting → Jobs → hours → hours Some SWE accounting charts

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 13 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 14 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting Some “EGEE View” charts

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 15 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 16 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring oCOD on Duty (COD) is done by Telefonica I+D helped by PIC; oCODs are grid expert teams which manage the day-to-day operation of the grid: –Active monitoring of the infrastructure; –Take appropriate action to protect the grid from the effects of failing components and to recover from operational problems. Ex:  A Resource Centre is causing problems by generating invalid information;  COD team opens a ticket to the Resource Centre;  COD team contacts the corresponding ROC operations support line;  COD team informs a network operations centre of suspected failures;  COD may remove the RC from the grid if the RC in unresponsive and until the problem has been fixed; –Many of these support and troubleshooting roles are undertaken in conjunction with Regional Operation Centres;  It is intended that tools will be developed to automate much of this work;

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 17 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring oCESGA maintains a GridICE portal for all the SWE RC’s. –The GridIce server collects information through specific sensors included in the EGEE middleware:  job information, grid service, fabric monitoring data. –Based on some plugins for Nagios:  Collect the data published by the sites;  Keeps them in a “postgresql” database;  Shows them in a web page. –GridICE also includes notifications about changes in the status of the sites (Hosts, important processes, etc... oCESGA is also responsible for the SWE monitoring alert system based on SFT/SAM results and Gstat: –Site Availability Monitoring:  Collection of comprehensive tests that are run daily on each certified site; –GStat Monitor:  A snapshot of the Grid Information System.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 18 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 19 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 20 Enabling Grids for E-sciencE INFSO-RI ROC SWE tasks: Monitoring

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 21 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Support oThe regional EGEE South-West federation help desk portal is maintained by CSIC-IFIC:EGEE South-West federation help desk portal –Users/Admins from the SWE federation can open tickets; oThe coordination of the user support services inside the federation is handled by LIP: –It is LIP responsibility to follow all tickets assigned to the SWE federation; –Make sure that they are routed to the correct RC and solved in time; –SWE ROC is automatically warned (and acts accordingly) when:  Open tickets are opened by users or COD staff on federation sites;  SAM or any other monitoring tool reports failures…

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 22 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Support oThe SWE help desk portal interacts with the EGEE Global Grid User Support (GGUS); oGGUS is a trouble ticketing system application:GGUS –Grid users and administrators can open tickets asking for help;  Users can start a ticket using independent regional portals. Local experts can try to solve the problem or assign it to the central GGUS service;  A ticket can also be opened directly in the GGUS services via a web form or ; –First line of support is provided by “Ticket Processing Managers”:  TPM teams are composed of 3 Grid experts, who change on a weekly basis;  TPM’s are able to provide a solution to a given grid operation problem or assign the issue to more specialized support unit. –Support is assured 5 days a week, 9 hours a day; –GGUS is used to start COD trouble tickets when the monitoring jobs fail; oLIP contributes with one “Ticket Processing Manager” team for the general GGUS tasks.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 23 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Support Regional SWE help-desk

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 24 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Authentication and Security oThe emission of valid certificates for EGEE for SWE region is operated by: –LIP, through the LIP Certification Authority (LIPCA), in Portugal; –CSIC-IFCA and PK-IRISGRID in Spain. oThese CA’s are members of the European Policy Management Authority for Grid Authentication in e-Science (EUGridPMA). –EUGridPMA coordinates a Public Key Infrastructure (PKI) used in the emission of X.509 certificates; oSWE CAs participate in the body of EUGridPMA and in the revision of the CP/CPS (Certificate Policy/Certification Practice Statement). oLIP (in Portugal) and RED.ES (in Spain) are responsible for security coordination and for handling security incidences.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 25 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Middleware integration ogLite is the middleware layer developed by EGEE. –Extends the use of the grid infrastructure to all fields of science; –Follows a Service Oriented Architecture (SOA):  Decreases the middleware dependence on the user’s applications and interactions with the different services. ogLite middleware doesn’t support all LRMs systems: –Only LFS and Torque/Maui batch schedulers by default: –LIP and CESGA, together with IC, are involved in an EGEE task force to provide gLite support for SGE batch system:  New jobmanager implementation;  New infoprovider scripts;  Upgrade the yaim installation procedure.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 26 Enabling Grids for E-sciencE INFSO-RI SWE pre-production testbed oIn parallel with the EGEE production testbed, some SWE sites also participate in a pre-production testbed: –CESGA, CSIC-IFIC, LIP and PIC; oObjectives of the pre-production testbed: –Test new middleware releases;  First contact with new services;  Test all services interactions/interconnections;  Report bugs to the developers;  Test bug fixes; –Release the middleware packages/patches which were correctly validated to the production testbed; oSWE ROC participates in the validation process of middleware components and helps the deployment in the RC’s.

Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 27 Enabling Grids for E-sciencE INFSO-RI Summary & Conclusions oWe have presented the main EGEE SWE federation activities: –Its resources for the production testbed; –Its operation and regional management procedures; –Its responsibilities in the some general EGEE tasks:  Certification;  Accounting;  Support;  Monitoring  Authentication;  Middleware tests and integration; –Further details regarding EGEE SWE federation activities can be obtained consulting the SWE portal mantained by the CSIC-IFCA. oThis presentation aims to a better understanding of the EGEE project, their fundamental organization and to acknowledge how the different resources work together to deliver high quality services to the users.