EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Steven Newhouse (substituting for Maite.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

EGI: A European Distributed Computing Infrastructure Steven Newhouse Interim EGI.eu Director.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
EGEE-III INFSO-RI Enabling Grids for E-sciencE COD June 2009 COD-20 Hélène Cordier COD-20, CNRS-IN2P3, CSC.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Romanian SA1 report Alexandru Stanciu ICI.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks PoW for the second year Transition to EGI.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse EGEE’s plans for transition.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ?? Athens, May 5-6th 2009 Community Support.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Angela Poschlad (PPS-FZK), Antonio Retico.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse Technical Director CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Bob Jones EGEE project director CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Report from GGUS BoF Session at the WLCG.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-III-INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-III All Activity Meeting Brussels,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Gergely Sipos Activity Deputy Manager MTA.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse Technical Director CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
EGI-InSPIRE Steven Newhouse Interim EGI.eu Director EGI-InSPIRE Project Director Technical Director EGEE-III 1GDB - December 2009.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Ian Bird LCG Project Leader On the transition to EGI – Requirements from WLCG WLCG Workshop 24 th April 2008.
Security Policy: From EGEE to EGI David Kelsey (STFC-RAL) 21 Sep 2009 EGEE’09, Barcelona david.kelsey at stfc.ac.uk.
WLCG Laura Perini1 EGI Operation Scenarios Introduction to panel discussion.
Julia Andreeva on behalf of the MND section MND review.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
AEGIS Academic and Educational Grid Initiative of Serbia Antun Balaz (NGI_AEGIS Technical Manager) Dusan Vudragovic (NGI_AEGIS Deputy.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks User Support for Distributed Computing Infrastructures.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-17
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSA3.4.1 “The process document” Oliver Keeble.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse Technical Director CERN.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Operations Portal Development Update on Requirements Cyril L'Orphelin IN2P3/CNRS.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
NGI_TR Emrah Akkoyun TR-Grid Operational Center EGI-InSPIRE – SA1 Kickoff Meeting1.
EGI Process Assessment and Improvement Plan – EGI core services – Tiziana Ferrari FedSM project 1EGI Process Assessment and Improvement Plan (Core Services)
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-16 (Transition to EGEE-III) Report to.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks IT ROC: Vision for EGEE III Tiziana Ferrari.
SA1 Status Report EGEE Grid Operations & Management
Ian Bird GDB Meeting CERN 9 September 2003
Maite Barroso, SA1 activity leader CERN 27th January 2009
Nordic ROC Organization
Presentation transcript:

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse (substituting for Maite Barroso Lopez) EGEE-III Final Review, June, 2010 Grid Operations SA1 Status Report

Enabling Grids for E-sciencE EGEE-III INFSO-RI SA1 Activity Overview SA1 – Steven Newhouse- EGEE-III Final Review June countries, 175 FTE

Enabling Grids for E-sciencE EGEE-III INFSO-RI Grid Operations Operate the EGEE production infrastructure, providing a high quality service to all the application groups Enabled through support structures including: –Regional Operation Centres –Global Grid User Support (GGUS) –Release and deployment management –Grid security coordination –Operational procedures and tools Structural changes needed in order to transition to a sustainable model SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Size of the infrastructure SA1 – Steven Newhouse- EGEE-III Final Review June Number of EGEE-III certified sites (April 2010): 317 (in 52 countries) Computing resources, doubled since last year: 243,020 cores (139,000 last year) 500 MSI2k (212 MSI2k last year) ( MSI2k = Normalised CPU time to a reference value of Mega SPECint 2000 ) Storage resources: 40 PB disk, 61 PB tape (25 PB disk, 38PB tape last year)

Enabling Grids for E-sciencE EGEE-III INFSO-RI Usage of the infrastructure (I) SA1 – Steven Newhouse- EGEE-III Final Review June Usage from non-LHC communities stable LHC usage growing, real activity: Real data processing, simulations, analysis CPU utilization over time: Consistent doubling every months

Enabling Grids for E-sciencE EGEE-III INFSO-RI Usage of the infrastructure (II) SA1 – Steven Newhouse- EGEE-III Final Review June Number of jobs Average of 15 million jobs per month (14M last year) jobs/day ( jobs/day last year) Non-LHC VOs with stable workload LHC running increasingly high workloads anticipate millions / day soon

Enabling Grids for E-sciencE EGEE-III INFSO-RI Site availability and reliability (I) No changes in algorithm regarding last year –Service available if any of the service instances are available (logical OR) –Site available if all services are available (logical AND of all service types) EGEE League Tables (monthly reports) published and followed up with ROCs –Also available for top 10 VOs Underperforming sites requested to report justifications, collected in a wiki page – SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI EGEE League Tables SA1 – Steven Newhouse- EGEE-III Final Review June Most regions have a good performance well above targets (70% availability and 75% reliability) April had highest average availability in the last year: 91% New ROCs started with lower A/R levels than experienced ones, took a few months to achieve targets April 2010

Enabling Grids for E-sciencE EGEE-III INFSO-RI Site availability and reliability (II) Sites with availability <50% for three consecutive months are removed from the Production infrastructure –7 sites removed, from AsiaPacific and Russia Simplification of downtime declaration rules –To enforce scheduled downtimes –Unscheduled downtimes has a big impact in site availability! And to users! –Ratio of scheduled/unscheduled downtimes doubled between May-09 and Apr-10  Sites are doing a better job following the procedures and pro-actively informing their user communities of planned maintenances Most regions have a good performance well above targets –70% availability –75% reliability SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Site availability seen by VOs Views from WLCG dashboards Built on top of VO specific grid monitoring probe results Using same grid monitoring framework SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Site availability seen by VOs Views from the Complex Systems VO Monitoring Built on top of VO specific grid monitoring probe results Using same grid monitoring framework SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Release and deployment management Move to Staged rollout as transition to EGI model –Controlled deployment of middleware updates into production –Get early feedback on the release under a production load –Done by a few sites on behalf of the community (to avoid duplication) –All ROCs agree with the concept, however few sites involved - so far  Requires commitment to upgrade quickly in production  As it is in production, might affect availability and reliability Process for retiring old versions of middleware services –Prototyped with Resource Brokers –Being implemented with gLite 3.1 services (replaced by gLite 3.2)  Big/medium sites able to follow new versions  Small ones slow, different issues, e.g. Obsolete H/W not running SL5  Sites per Software version: 48% gLite 3.2, 52% gLite 3.1 List of gLite service versions supported by SA1 Operations SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Distributed Operational Security Strong expertise in incident handling –Defined timelines for response and resolution  Security drills to educate and enforce procedures. Main source of incidents: failure to apply security patches –First alert in August 2009, EGEE completely patched after 2.5 months  First time a security patch was enforced at the sites Escalated to EGEE-III PMB and WLCG Management Board  60+ sites warned about suspension within 7 days, only 4 suspended –Second alert in November 2009, EGEE patched after 1 month  30+ sites warned about suspension, none suspended Effective security patching monitoring –Security monitoring tools developed and deployed to prevent security incidents –Identify security weaknesses at sites to prevent incidents Extensive connections outside the project –Build links between ROC security contacts and local NRENs SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Operational Security Support Grid Security Vulnerability Group –Help identify security weaknesses inorder to prevent incidents in deployed Grid middleware –33 issues handled during the last year (60 since the start of EGEE-III) –Transition in progress to cover all the software distributed by EGI Joint Security Policy Group –5 existing policies revised and 2 new polices:  VO Registration, VO Membership Management, Security Incident Response, Grid AUP, Site Registration  User-level Job Accounting, VO Portals –Security Policy Framework developed to support move to EGI International Grid Trust Federation –Bridged Research & Education federated ID and grid certificates –New guidelines for key management and the use of automated clients published SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Global Grid User Support (GGUS) The EGEE helpdesk is a distributed system based on regional ticketing systems with central coordination and a central entry point for user support It is the only front-end exposed to end users –Developers & testers submit bugs directly to Savannah –GGUS tickets leading to Savannah bugs  Closed when Savannah bug reaches status “Ready for review”.  44 tickets ended in Savannah bugs since January 2010  28 of these for components under the responsibility of JRA1 Changes to support increased automation –Direct routing to sites (bypass 1st line support)  Reduction in tickets handled manually in the last year (2023 versus 3912)  ~80% of all 9419 tickets are now assigned directly –Additional automation options implemented when justified  Direct assignment to some VOs and GGUS Support Units  ALARM and TEAM tickets  Ticket escalation by submitter SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Global Grid User Support SA1 – Steven Newhouse- EGEE-III Final Review June None: tickets where the submitting user does not explicitly define the associated VO Atlas is the major user (31%), followed by LHCb (9%), CMS (3%) and biomed (3%) Each VO has a different support model Originating Community (Select VO from dropdown list) – Y2

Enabling Grids for E-sciencE EGEE-III INFSO-RI Global Grid User Support SA1 – Steven Newhouse- EGEE-III Final Review June st line support (TPM) also solving 4% of the tickets 2 nd line Specialised Software Support Units solving 8% of the tickets Most tickets solved at ROCs/sites, 87% User support tickets created in Y2 for each support unit type

Enabling Grids for E-sciencE EGEE-III INFSO-RI Grid Operator on duty Role of oversight and 1 st level support for grid production infrastructure –Teams of operators on duty (COD) opening tickets on sites in response to grid monitoring alarms New model, based on the devolution to regions, fully implemented Summer ‘09 –First-line support done by each region, plus common layer for procedures, tools, escalation (the R-COD) Improvements: –Regional teams closer to the sites  Aware of different site deployment scenarios  More incentivised to prevent incidents –Shorter response time: 5.75 days (ticket assignment  closing)  Before regionalisation average response time 8 days –Distribution of responsibilities to ROCs  NGIs  Concern: Skills at ROC level will get partitioned when ROCs become NGIs SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Grid Operator on duty SA1 – Steven Newhouse- EGEE-III Final Review June Tickets opened in the last yearTickets opened per month NAGIOS 2570 tickets opened by the R-COD in the last year with an average response time 5.75 days Stable number of tickets opened by month, also after migration from SAM to Nagios Move to new technology did not impact operations Expect reduction in regional alarms as more sites start deploying Nagios Sites see issues before they are reported to them by R-CODs

Enabling Grids for E-sciencE EGEE-III INFSO-RI Grid Operations Tools SA1 – Steven Newhouse- EGEE-III Final Review June Aims –Improve reliability and availability of sites via operational tools –Prepare operational tools for use in an EGI/NGI structure Achievements: –Better tools for site admins  faster problem detection & solution –Operations dashboard with regional views, to raise alarms and GGUS tickets, used by R-COD –MyEGEE visualization portal, collaboration with OSG –Availability/Reliability now calculated from regional Nagios results –Adoption and integration of mature open-source software –Site and Regional Monitoring via Nagios monitoring framework –Integration of operational tools via ActiveMQ messaging system

Enabling Grids for E-sciencE EGEE-III INFSO-RI Grid Operations Tools Last 2 years: Now: SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI EGI: NGI prototyping Integration of NGIs into the infrastructure has started –Prototyped with 2 NGIs from 2 different multi-country ROCs  Poland from Central Europe  Greece from South East Europe –Centrally: Registration in managerial and operational structures –Locally: Tool deployment and adaption of processes (as required) Process now understood and documented Now being followed by many other NGIs –Issues: Tight coordination needed between tools and teams to enable the new NGI –Issues: With these two examples NGI staff overlapped with ROC staff with many years cumulated knowledge/experience  Most NGIs will need close support and training SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Lessons learnt Regionalization is expensive due to extra layers –Move to national based infrastructures has ‘cheaper’ core Operation tools –Importance of agreeing on formats for information exchange between tools –Rely where possible on widely adopted public domain solutions (e.g. Nagios, ActiveMQ) –Need to have a more integrated operational tool development activity SA1 – Steven Newhouse- EGEE-III Final Review June

Enabling Grids for E-sciencE EGEE-III INFSO-RI Summary Provided a stable and reliable infrastructure –While doubling (approx.) in capacity –Significant operational changes Need to improve operational tools –Home-grown tools & infrastructures costly to maintain and support Good transition to EGI, no major issues left –Full Implementation of ROC  NGI ongoing Global scientific collaborations relying on it –WLCG main user group, solid base for ESFRI engagement Collaboration was essential, important not to break this with regionalization/NGIs SA1 – Steven Newhouse- EGEE-III Final Review June