Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations EGEE and OSG Maite Barroso, CERN Ruth Pordes, Fermilab LHCC Comprehensive Review 25th September, 2006
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Outline EGEE operations OSG operations EGEE – OSG interoperations
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE: > 190 sites, 40 countries ~ 155 sites certified and in production > 28,000 processors, ~ 26 PB storage EGEE Infrastructure: size
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE Infrastructure: usage ~6000 cpu-months/month
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE operation: Key objectives Grid management –ROCs, relations with resource providers through negotiation of service-level agreements (SLAs) Middleware deployment and introducing new resources Operate a set of essential core infrastructure services Grid monitoring and control Resource and user support International collaboration –to drive collaboration with peer organisations in the Americas and the Asia- Pacific region to ensure the interoperability of Grid infrastructures and services so that the EGEE-II user communities Capture and provide middleware requirements Grid security and incident response Long term sustainability of the infrastructure –to work both within the project and with the other related infrastructure projects and embryonic National Grid Infrastructures to put in place the necessary structures and organisation to ensure a long term sustainable infrastructure
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid management: structure Operations Coordination Centre (OCC) –responsible for the overall activity management, oversight of all operational and support activities Regional Operations Centres (ROC) –providing the core of the support infrastructure, each supporting a number of resource centres within its region Resource centres –providing resources (computing, storage, network, etc.); Grid User Support (GGUS) –coordination and management of user support activities, single point of contact (portal) for users
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations coordination ROC managers meeting –Biweekly –Discuss inter-ROC issues, general coordination, interfaces with other activities WLCG-EGEE-OSG Operations meeting –Weekly, Mondays at 16:00 (Swiss time) –WLCG/OSG/EGEE –Pre-reports from sites, ROCs and VOs through CIC portal –Discuss, track and solve operation related issues from the previous week Operation Workshops –Twice per year. Some joint between WLCG/OSG/EGEE –Last one: June –Next one: Spring 2007
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Middleware deployment Development team 3 Development team 2 Development team 1 Certification PPS APT repository Software passes certification Technical Coordination Group (TCG) Longer term strategy Certification APT repository Build is ready EMT Steer next release Integration Tagged RPMs gLite Middleware Savannah Bugs Pre-prod. Service Bugs Production service Production APT repository Software OK in PPS
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid monitoring and control The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources Regional Operations Centre …… Resource Centre Resource Centre … Regional Operations Centre Resource Centre Resource Centre … OSCT Grid Operator on-duty (COD) Monitoring shows a problem
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid Operator on Duty Role: –Watch the problems detected by the grid monitoring tools –Problem diagnosis –Report these problems (GGUS tickets) –Follow and escalate them if needed (well defined procedure) –Provide help, propose solutions –Build and maintain a central knowledge database (WIKI) Who does it?: –9 ROC teams working in pairs (one lead and one backup) on a weekly rotation –CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern- Europe, Central-Europe, Germany-Switzerland
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid monitoring tools Tools used by the Grid Operator on Duty team to detect problems Distributed responsibility CIC portal –single entry point –Integrated view of monitoring tools Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) Grid Operations Centre Core Database (GOCDB) GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor Others
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Site Functional Tests Site Functional Tests (SFT) –Framework to test (sample) services at all sites –Shows results matrix –Detailed test log available for troubleshooting and debugging –History of individual tests is kept –Can include VO-specific tests (e.g. sw environment) –Normally >80% of sites pass SFTs NB of 180 sites, some are not well managed Very important in stabilising sites: Apps use only good sites Bad sites are automatically excluded Sites work hard to fix problems
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Service Availability Monitoring Service Availability Monitoring (SAM) –Will cover all core grid services –measure availability by service, site, VO – each service has associated service class defining required availability (Critical, highly available, etc.) –Will be used to generate alarms – to generate trouble tickets – to call out support staff
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Site availability
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operational procedures Described at the operations manual: Introducing new resources Resource registration and contact information –Stored in GOCDB Site downtime scheduling Broadcast of planned and unplanned interventions –EGEE broadcast tool Site suspension –The site is then removed from the top-level BDII and monitoring is turned off Escalation procedures
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operational security From the EGEE Operational Security Coordination Team (OSCT) Recent security incident: –Many HEP sites affected by the recent incident –Local root compromises (on up to date machines) –Many compromised accounts (password sniffers) –Not a Grid attack as such but involved many LCG sites What went well? –Many people worked very hard –Collaboration was excellent –Sharing of necessary information was good –The Grid csirts list (and HEPIX security list) kept people informed What did not go so well? (matters for OSCT) –UK site decided (on the basis of following guidance) not to inform the Grid csirts –No incident handling team created (but CERN took the lead) –Private information leaked out on to several public mail lists and google searchable archives and web sites –Discussion supposed to happen on “contacts” list not “csirts” list – much activity on csirts list –Concern that sites who said they were not involved had not looked carefully enough –Need to strive for the correct balance in Open vs Closed communication –But must encourage sites to report
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Open Science Grid and WLCG The Open Science Grid contributes to the WLCG as the US distributed facility infrastructure. OSG delivers accountable resources and cycles for LHC experiment production and analysis. OSG federates with other infrastructures and interoperates with managerial, operational and technical activities. OSG cooperates with the EGEE to ensure an effective and transparent system for the experiments.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Current OSG deployment 96 Resources across production & integration infrastructures 27 Virtual Organizations including operations and monitoring groups >15,000 CPUs ~6 PB MSS ~4 PB disk
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September August OSG Usage- 3 largest VOs 50K & 90K CPU Hours/day ATLAS CDFCMS
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Running Jobs of Rest of the VOs OSG jobs are “jobs submitted via OSG interfaces or services 3 large VOs had ~3500 simultaneous jobs in same period 1000 jobs
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Software Release & Patches These are subsets of the VDT, tailored to OSG 2 OSG major releases a year. >4 minor releases a year. Development releases for testing Critical patches have separate path.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Site and Service Validation Validation services being packages for use by any VO. Grid Operations runs the validations also: –Site-Verify executed by Operations under the operations VO. –Job execution and file transfer tests executed under the GridEx VO. GridCat displays results of validations for “red” “green” presentation display. Integration Grid provides system for Application validation of releases and patches to the software and new services.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Support Model in OSG Distributed set of Support Centers covers all aspects of OSG –VO, Resources, Services, Middleware, Community –A support center may support multiple activities. The goal of the OSG support model is to provide OSG users and resources with rapid responses to reported issues. Each VO supports their own users and resources. There is an OSG Grid Operations Center for coordination and routing of issues along with critical infrastructure components. OSG GOC has final responsibility for releases of the OSG software stack (including patches).
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September OSG Grid Operations Center Supports Centralized Grid Services –Monitoring Tools (MonALISA, GridCat) –Resource Information Tools (VORS, BDII) –Centralized Trouble Ticketing –Interaction with Peering Grids (EGEE/TeraGrid) –Communication Hub –Software Packaging –Documentation of Operations Information –Security Response –Keeps Definitive Contact Directory for VOs, Resources, and Support Centers –Releasing Critical Patches/Upgrades to OSG And supports the OSG VO
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Support Mechanisms in OSG Distributed set of Support Centers for all production activities in OSG –VO, Resources, Services, Middleware, Community –A support center may support multiple activities. When VOs, Resources, or Services are registered they identify a Support Center (may be Community Support). All Support Centers participate in OSG Operations.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Examples Support Services Middleware –VDT is core-middleware support center. Other direct middleware support contacts e.g. Monalisa. –VOs and other support centers are provided with a path to the middleware representatives –VDT has Weekly office hours and independent trouble ticket system Community Support –Open support for Users and Resources not covered by an specific support center. –Voluntary Participation on mail lists & Community Chat Room User Support –VO Users Contact their VO support center to begin the troubleshooting process –Problems are routed by the OSG-GOC to the responsible Support Center if problem moves outside the VO –Support Documents should be made available from VO Support Center and recorded on the OSG Twiki along with VO policy –Local Ticketing Systems for some VOs Application Support –Application questions go directly to the VO Support Center for routing/troubleshooting.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Security Operations Security Officer plans and coordinates Integrated Security Management consisting of Risk Assessment of vulnerabilities resulting in Management, Operations and Technical controls. Equivalence of Site and VO responsibilities and procedures. Incident Response includes identified security contacts of all OSG organizations.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE – OSG interoperations Coordination –WLCG-EGEE-OSG operations meeting –Operations workshop Focused of last one was OSG-EGEE interoperations, much progress achieved –Regular phone calls to make progress on specific areas Operations tools: common and/or interoperable –Global BDII extracted from EGEE and OSG registration DBs –GGUS interfaced to OSG FootPrints –Site/service monitoring tools interfacing being discussed Security: work is underway to share security contact information and incident information –Cross population of mail lists –EGEE sites in the OSG lists And vice-versa –Technical details still to be agreed Read access to GOC-DB etc –Ensure consistent (and many times common) policies through joint working groups.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Problem Reports 3 WLCG ROCs in the US: US-ATLAS, US-CMS, OSG-GOC. All tickets routed from WLCG through OSG-GOC. OSG GOC and EGEE GGUS exchange and automatically route tickets. OSG-GOC automatically routes tickets to US-CMS-ROC and, currently, manually routes tickets to US-ATLAS-ROC
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE OSG Activities Completed –Interoperation of information published in BDII for use by WLCG Resource Brokers. In progress –Operations VO, “Ops” on EGEE and OSG for common tests and validations. –Programmatic interface to trouble ticket sysetm which allows retrieval of EGEE - OSG resource scheduled downtimes. To watch for –How do communicate and test interoperability of changes (interfaces and capabilities) before they get to production? –How do we communicate about new s/w developments in time to have common approaches & avoid duplication & divergence? –How do we manage ourselves to not give in to “panic mode” responses & give ourselves time to not organize “just in time”. –How do we prioritize support for our non-WLCG stakeholders during data taking?
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Summary WLCG Operations is a focus of EGEE and OSG Operations. The 2 grid infrastructures are working together to ensure smooth, scalable, and effective production support.