INFSO-RI Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges, M. David, N. Dias, J. Gomes, J. P. Martins LIP: Laboratório de Instrumentação em Física Experimental de Partículas C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. Pacheco PIC: Port d’Informació Científica F. Bernabé, J. Fontán, J. Lopez, P. Rey CESGA: Fundación Centro Tecnológico de Supercomputación de Galicia R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas J. Palacios IFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 2 Enabling Grids for E-sciencE INFSO-RI Outline oThe EGEE grid project. oMain operation activities inside EGEE South-West grid infrastructure: –Resources; –Activities coordination: Certification; Sites and middleware certification; Accounting; EGEE View Participation in the Accounting Enforcement task; Monitoring; Interaction with the Grid Operation Centre (GOC); Participation in COD; Support; Interaction with the Global Grid User Support (GGUS); Authentication and Security; Activities in the EUGridPMA framework. Middleware tests and integration.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 3 Enabling Grids for E-sciencE INFSO-RI EGEE project oThe Enabling Grids for E-sciencE project: –An European financed grid project; –The biggest world wide grid for multi-disciplinary sciences; Integrates several national and regional grids; More then 90 partners distributed over 32 countries; –Developed on top of the infrastructures and software built in EDG and LCG grid projects. oThe LHC Computing Grid project: –LHC will be the world most powerful particle accelerator; Built at CERN and expected to start operating in 2007; –LCG aims to build and maintain a data storage and analysis infrastructure for the large LHC physics community: 15 Petabytes of experimental data annually, Available during the 15 years life time of the LHC machine; Fully accessible to ~5000 scientists from more than 500 institutes.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 4 Enabling Grids for E-sciencE INFSO-RI EGEE project oEGEE concentrates in three core areas: –Improve and maintain the middleware; Provide a reliable service; –Attract new users from industry as well as from science; Ensure they receive high standard of training and support; –Combine national, regional and thematic Grid efforts; For a seamless Grid infrastructure for scientific research and to build a sustainable Grid for business research and industry. oEGEE has expanded from the originally two scientific field (High energy physics and life sciences) and now integrates applications from other scientific fields: –Astrophyics; Biomedic and Bioinformatic applications; –Computational chemistry; Earth Sciencies; –Finance; Fusion; Geophysics; –(...) oEGEE supports more than 100 virtual organizations.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 5 Enabling Grids for E-sciencE INFSO-RI EGEE project
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 6 Enabling Grids for E-sciencE INFSO-RI EGEE Operations: The GOC oThe Grid Operations Centre is responsible for coordinating the overall operation of the EGEE Grid:Grid Operations Centre –Devises and manages mechanisms and procedures which encourage optimal operation of the Grid; –It acts as a central point of operational information such as: Site local and central services; Site resources configuration; Contact details. –Monitores the operation of the Grid Infrastructure as a whole; GOC works with the federation local support groups to assist them in providing the best possible service while their infrastructure is connected to the Grid.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 7 Enabling Grids for E-sciencE INFSO-RI EGEE Operations: The ROCs oThe fulfillment of the federations key objectives is supervised by the Regional Operation Centre (ROC): –Operate essential core services; RBs, data management services, information services, VOMS servers; –Interface between VO requests and sites resources; –To provide monitoring and operational troubleshooting services; –Receiving, responding and coordinating the resolution of grid operation problems from the sites and users point of view. –South-Western Europe –France –UK/Ireland –Northern Europe –Germany/Switzerland –CERN –Italy –Central Europe –South Eastern Europe –Russia –Asia/Pacific
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 8 Enabling Grids for E-sciencE INFSO-RI South-West federation oEGEE South-West federation is part of the European Grid Operation, Support and Management activity (SA1).EGEE South-West federation oResponsible for maintaining high quality services of the grid infrastructure inside the South-West region: –Portuguese: LIP; –Spanish: CESGA, CSIC, PIC, CIEMAT, BIFI; –PIC is the “Tier 1” centre of the SWE federation. oThe EGEE SWE ROC is shared among the different institutes: –This requires a higher coordination effort; All operations/management questions are weekly reported to the ROC manager during a VRVS meeting; Promotes the communication between the different site managers; Promotes the knowledge exchange necessary for a faster resolution of problems.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 9 Enabling Grids for E-sciencE INFSO-RI South-West federation resources oEGEE South-West federation is presently offering… –Core services for the production testbed (13/10/2006): 8 Resource Brokers; 8 top BDII machines; 3 LFC central catalogs; 1 FTS service. –Local services for the production infrastructure: 18 Computing Elements; 1052 CPUs = Normalized CPUs. o(Norm = 1000 SpecInts2000 = Pentium 2.8 GHz). 18 Storage Elements; 35.4 TB of online storage (disk); 1.5 PB of nearline storage (tape backend). –These resources are currently shared according to the federation internal policies by more than 20 virtual organizations.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 10 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Site certification oThe SWE ROC is responsible for certifying if a site fulfills the necessary requirements to join the grid production infrastructure: –Performed by LIP in Portugal; –Performed by PIC in Spain; – The certification process consists on a set of demanding tests: Information system; Site configuration; Interactions with the central core services. –ROC negotiates service level agreements (SLA’s): Settle the level of services each Resource Center (RC) should provide to the infrastructure.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 11 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting oThe EGEE South-West federation was one of the first to widely deploy grid accounting tools; –CESGA is the responsible entity inside the South-West federation for maintaining the accounting portal;accounting portal –The most relevant information is monthly compiled and reported to the ROC and federation members. oDue to its expertise, CESGA was proposed as the responsible entity to handle the “Accounting enforcement task”… –Monitor all the EGEE infrastructure; –Check if all the Resource Centres are publishing correct accounting information and open tickets if they don’t; –Help the Resource Centres to deploy the necessary accounting tools; o… and take charge of the “EGEE View”: –Portal with accounting information from all EGEE sites.Portal
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 12 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting → Jobs → hours → hours Some SWE accounting charts
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 13 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 14 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting Some “EGEE View” charts
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 15 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Accounting
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 16 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring oCOD on Duty (COD) is done by Telefonica I+D helped by PIC; oCODs are grid expert teams which manage the day-to-day operation of the grid: –Active monitoring of the infrastructure; –Take appropriate action to protect the grid from the effects of failing components and to recover from operational problems. Ex: A Resource Centre is causing problems by generating invalid information; COD team opens a ticket to the Resource Centre; COD team contacts the corresponding ROC operations support line; COD team informs a network operations centre of suspected failures; COD may remove the RC from the grid if the RC in unresponsive and until the problem has been fixed; –Many of these support and troubleshooting roles are undertaken in conjunction with Regional Operation Centres; It is intended that tools will be developed to automate much of this work;
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 17 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring oCESGA maintains a GridICE portal for all the SWE RC’s. –The GridIce server collects information through specific sensors included in the EGEE middleware: job information, grid service, fabric monitoring data. –Based on some plugins for Nagios: Collect the data published by the sites; Keeps them in a “postgresql” database; Shows them in a web page. –GridICE also includes notifications about changes in the status of the sites (Hosts, important processes, etc... oCESGA is also responsible for the SWE monitoring alert system based on SFT/SAM results and Gstat: –Site Availability Monitoring: Collection of comprehensive tests that are run daily on each certified site; –GStat Monitor: A snapshot of the Grid Information System.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 18 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 19 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Monitoring
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 20 Enabling Grids for E-sciencE INFSO-RI ROC SWE tasks: Monitoring
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 21 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Support oThe regional EGEE South-West federation help desk portal is maintained by CSIC-IFIC:EGEE South-West federation help desk portal –Users/Admins from the SWE federation can open tickets; oThe coordination of the user support services inside the federation is handled by LIP: –It is LIP responsibility to follow all tickets assigned to the SWE federation; –Make sure that they are routed to the correct RC and solved in time; –SWE ROC is automatically warned (and acts accordingly) when: Open tickets are opened by users or COD staff on federation sites; SAM or any other monitoring tool reports failures…
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 22 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Support oThe SWE help desk portal interacts with the EGEE Global Grid User Support (GGUS); oGGUS is a trouble ticketing system application:GGUS –Grid users and administrators can open tickets asking for help; Users can start a ticket using independent regional portals. Local experts can try to solve the problem or assign it to the central GGUS service; A ticket can also be opened directly in the GGUS services via a web form or ; –First line of support is provided by “Ticket Processing Managers”: TPM teams are composed of 3 Grid experts, who change on a weekly basis; TPM’s are able to provide a solution to a given grid operation problem or assign the issue to more specialized support unit. –Support is assured 5 days a week, 9 hours a day; –GGUS is used to start COD trouble tickets when the monitoring jobs fail; oLIP contributes with one “Ticket Processing Manager” team for the general GGUS tasks.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 23 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Support Regional SWE help-desk
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 24 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Authentication and Security oThe emission of valid certificates for EGEE for SWE region is operated by: –LIP, through the LIP Certification Authority (LIPCA), in Portugal; –CSIC-IFCA and PK-IRISGRID in Spain. oThese CA’s are members of the European Policy Management Authority for Grid Authentication in e-Science (EUGridPMA). –EUGridPMA coordinates a Public Key Infrastructure (PKI) used in the emission of X.509 certificates; oSWE CAs participate in the body of EUGridPMA and in the revision of the CP/CPS (Certificate Policy/Certification Practice Statement). oLIP (in Portugal) and RED.ES (in Spain) are responsible for security coordination and for handling security incidences.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 25 Enabling Grids for E-sciencE INFSO-RI SWE ROC tasks: Middleware integration ogLite is the middleware layer developed by EGEE. –Extends the use of the grid infrastructure to all fields of science; –Follows a Service Oriented Architecture (SOA): Decreases the middleware dependence on the user’s applications and interactions with the different services. ogLite middleware doesn’t support all LRMs systems: –Only LFS and Torque/Maui batch schedulers by default: –LIP and CESGA, together with IC, are involved in an EGEE task force to provide gLite support for SGE batch system: New jobmanager implementation; New infoprovider scripts; Upgrade the yaim installation procedure.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 26 Enabling Grids for E-sciencE INFSO-RI SWE pre-production testbed oIn parallel with the EGEE production testbed, some SWE sites also participate in a pre-production testbed: –CESGA, CSIC-IFIC, LIP and PIC; oObjectives of the pre-production testbed: –Test new middleware releases; First contact with new services; Test all services interactions/interconnections; Report bugs to the developers; Test bug fixes; –Release the middleware packages/patches which were correctly validated to the production testbed; oSWE ROC participates in the validation process of middleware components and helps the deployment in the RC’s.
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 27 Enabling Grids for E-sciencE INFSO-RI Summary & Conclusions oWe have presented the main EGEE SWE federation activities: –Its resources for the production testbed; –Its operation and regional management procedures; –Its responsibilities in the some general EGEE tasks: Certification; Accounting; Support; Monitoring Authentication; Middleware tests and integration; –Further details regarding EGEE SWE federation activities can be obtained consulting the SWE portal mantained by the CSIC-IFCA. oThis presentation aims to a better understanding of the EGEE project, their fundamental organization and to acknowledge how the different resources work together to deliver high quality services to the users.