Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
EGEE ARM-2 – 5 Oct LCG Security Coordination Ian Neilson LCG Security Officer Grid Deployment Group CERN.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks PPS All sites Meeting: Introduction & Agenda.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
Enabling Grids for E-sciencE SA1 EGEE-II INFSO-RI The Pre-Production Service in WLCG/EGEE A. Retico, N. Thackray CERN – Geneva, Switzerland PPS.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Next steps with EGEE EGEE training community.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
UKI ROC/GridPP/EGEE Security Mingchao Ma Oxford 22 October 2008.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE II: an eInfrastructure for Europe and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
EGI-InSPIRE Steven Newhouse Interim EGI.eu Director EGI-InSPIRE Project Director Technical Director EGEE-III 1GDB - December 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC Security Contacts R. Rumler Lyon/Villeurbanne.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Steven Newhouse (substituting for Maite.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Operations Activity Doug Olson, LBNL Co-chair OSG Operations OSG Council Meeting 3 May 2005, Madison, WI.
Ian Bird LCG Project Leader On the transition to EGI – Requirements from WLCG WLCG Workshop 24 th April 2008.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
Security Policy: From EGEE to EGI David Kelsey (STFC-RAL) 21 Sep 2009 EGEE’09, Barcelona david.kelsey at stfc.ac.uk.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Recent lessons learned: Operational Security David Kelsey CCLRC/RAL, UK GDB Meeting, BNL, 5 Sep 2006.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
Components Selection Validation Integration Deployment What it could mean inside EGI
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Security aspects (based on Romain Wartel’s.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
Bob Jones EGEE Technical Director
Regional Operations Centres Core infrastructure Centres
Operations Interfaces and Interactions
SA1 Execution Plan Status and Issues
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
LCG/EGEE Incident Response Planning
Nordic ROC Organization
Leigh Grundhoefer Indiana University
Presentation transcript:

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations EGEE and OSG Maite Barroso, CERN Ruth Pordes, Fermilab LHCC Comprehensive Review 25th September, 2006

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Outline EGEE operations OSG operations EGEE – OSG interoperations

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE: > 190 sites, 40 countries ~ 155 sites certified and in production > 28,000 processors, ~ 26 PB storage EGEE Infrastructure: size

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE Infrastructure: usage ~6000 cpu-months/month

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE operation: Key objectives Grid management –ROCs, relations with resource providers through negotiation of service-level agreements (SLAs) Middleware deployment and introducing new resources Operate a set of essential core infrastructure services Grid monitoring and control Resource and user support International collaboration –to drive collaboration with peer organisations in the Americas and the Asia- Pacific region to ensure the interoperability of Grid infrastructures and services so that the EGEE-II user communities Capture and provide middleware requirements Grid security and incident response Long term sustainability of the infrastructure –to work both within the project and with the other related infrastructure projects and embryonic National Grid Infrastructures to put in place the necessary structures and organisation to ensure a long term sustainable infrastructure

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid management: structure Operations Coordination Centre (OCC) –responsible for the overall activity management, oversight of all operational and support activities Regional Operations Centres (ROC) –providing the core of the support infrastructure, each supporting a number of resource centres within its region Resource centres –providing resources (computing, storage, network, etc.); Grid User Support (GGUS) –coordination and management of user support activities, single point of contact (portal) for users

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations coordination ROC managers meeting –Biweekly –Discuss inter-ROC issues, general coordination, interfaces with other activities WLCG-EGEE-OSG Operations meeting –Weekly, Mondays at 16:00 (Swiss time) –WLCG/OSG/EGEE –Pre-reports from sites, ROCs and VOs through CIC portal –Discuss, track and solve operation related issues from the previous week Operation Workshops –Twice per year. Some joint between WLCG/OSG/EGEE –Last one: June –Next one: Spring 2007

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Middleware deployment Development team 3 Development team 2 Development team 1 Certification PPS APT repository Software passes certification Technical Coordination Group (TCG) Longer term strategy Certification APT repository Build is ready EMT Steer next release Integration Tagged RPMs gLite Middleware Savannah Bugs Pre-prod. Service Bugs Production service Production APT repository Software OK in PPS

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid monitoring and control The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources Regional Operations Centre …… Resource Centre Resource Centre … Regional Operations Centre Resource Centre Resource Centre … OSCT Grid Operator on-duty (COD) Monitoring shows a problem

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid Operator on Duty Role: –Watch the problems detected by the grid monitoring tools –Problem diagnosis –Report these problems (GGUS tickets) –Follow and escalate them if needed (well defined procedure) –Provide help, propose solutions –Build and maintain a central knowledge database (WIKI) Who does it?: –9 ROC teams working in pairs (one lead and one backup) on a weekly rotation –CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern- Europe, Central-Europe, Germany-Switzerland

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Grid monitoring tools Tools used by the Grid Operator on Duty team to detect problems Distributed responsibility CIC portal –single entry point –Integrated view of monitoring tools Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) Grid Operations Centre Core Database (GOCDB) GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor Others

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Site Functional Tests Site Functional Tests (SFT) –Framework to test (sample) services at all sites –Shows results matrix –Detailed test log available for troubleshooting and debugging –History of individual tests is kept –Can include VO-specific tests (e.g. sw environment) –Normally >80% of sites pass SFTs  NB of 180 sites, some are not well managed Very important in stabilising sites: Apps use only good sites Bad sites are automatically excluded Sites work hard to fix problems

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Service Availability Monitoring Service Availability Monitoring (SAM) –Will cover all core grid services –measure availability by service, site, VO – each service has associated service class defining required availability (Critical, highly available, etc.) –Will be used to generate alarms – to generate trouble tickets – to call out support staff

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Site availability

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operational procedures Described at the operations manual: Introducing new resources Resource registration and contact information –Stored in GOCDB Site downtime scheduling Broadcast of planned and unplanned interventions –EGEE broadcast tool Site suspension –The site is then removed from the top-level BDII and monitoring is turned off Escalation procedures

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operational security From the EGEE Operational Security Coordination Team (OSCT) Recent security incident: –Many HEP sites affected by the recent incident –Local root compromises (on up to date machines) –Many compromised accounts (password sniffers) –Not a Grid attack as such but involved many LCG sites What went well? –Many people worked very hard –Collaboration was excellent –Sharing of necessary information was good –The Grid csirts list (and HEPIX security list) kept people informed What did not go so well? (matters for OSCT) –UK site decided (on the basis of following guidance) not to inform the Grid csirts –No incident handling team created (but CERN took the lead) –Private information leaked out on to several public mail lists and google searchable archives and web sites –Discussion supposed to happen on “contacts” list not “csirts” list – much activity on csirts list –Concern that sites who said they were not involved had not looked carefully enough –Need to strive for the correct balance in Open vs Closed communication –But must encourage sites to report

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Open Science Grid and WLCG The Open Science Grid contributes to the WLCG as the US distributed facility infrastructure. OSG delivers accountable resources and cycles for LHC experiment production and analysis. OSG federates with other infrastructures and interoperates with managerial, operational and technical activities. OSG cooperates with the EGEE to ensure an effective and transparent system for the experiments.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Current OSG deployment 96 Resources across production & integration infrastructures 27 Virtual Organizations including operations and monitoring groups >15,000 CPUs ~6 PB MSS ~4 PB disk

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September August OSG Usage- 3 largest VOs 50K & 90K CPU Hours/day ATLAS CDFCMS

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Running Jobs of Rest of the VOs OSG jobs are “jobs submitted via OSG interfaces or services 3 large VOs had ~3500 simultaneous jobs in same period 1000 jobs

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Software Release & Patches These are subsets of the VDT, tailored to OSG 2 OSG major releases a year. >4 minor releases a year. Development releases for testing Critical patches have separate path.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Site and Service Validation Validation services being packages for use by any VO. Grid Operations runs the validations also: –Site-Verify executed by Operations under the operations VO. –Job execution and file transfer tests executed under the GridEx VO. GridCat displays results of validations for “red” “green” presentation display. Integration Grid provides system for Application validation of releases and patches to the software and new services.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Support Model in OSG Distributed set of Support Centers covers all aspects of OSG –VO, Resources, Services, Middleware, Community –A support center may support multiple activities. The goal of the OSG support model is to provide OSG users and resources with rapid responses to reported issues. Each VO supports their own users and resources. There is an OSG Grid Operations Center for coordination and routing of issues along with critical infrastructure components. OSG GOC has final responsibility for releases of the OSG software stack (including patches).

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September OSG Grid Operations Center Supports Centralized Grid Services –Monitoring Tools (MonALISA, GridCat) –Resource Information Tools (VORS, BDII) –Centralized Trouble Ticketing –Interaction with Peering Grids (EGEE/TeraGrid) –Communication Hub –Software Packaging –Documentation of Operations Information –Security Response –Keeps Definitive Contact Directory for VOs, Resources, and Support Centers –Releasing Critical Patches/Upgrades to OSG And supports the OSG VO

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Support Mechanisms in OSG Distributed set of Support Centers for all production activities in OSG –VO, Resources, Services, Middleware, Community –A support center may support multiple activities. When VOs, Resources, or Services are registered they identify a Support Center (may be Community Support). All Support Centers participate in OSG Operations.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Examples Support Services Middleware –VDT is core-middleware support center. Other direct middleware support contacts e.g. Monalisa. –VOs and other support centers are provided with a path to the middleware representatives –VDT has Weekly office hours and independent trouble ticket system Community Support –Open support for Users and Resources not covered by an specific support center. –Voluntary Participation on mail lists & Community Chat Room User Support –VO Users Contact their VO support center to begin the troubleshooting process –Problems are routed by the OSG-GOC to the responsible Support Center if problem moves outside the VO –Support Documents should be made available from VO Support Center and recorded on the OSG Twiki along with VO policy –Local Ticketing Systems for some VOs Application Support –Application questions go directly to the VO Support Center for routing/troubleshooting.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Security Operations Security Officer plans and coordinates Integrated Security Management consisting of Risk Assessment of vulnerabilities resulting in Management, Operations and Technical controls. Equivalence of Site and VO responsibilities and procedures. Incident Response includes identified security contacts of all OSG organizations.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE – OSG interoperations Coordination –WLCG-EGEE-OSG operations meeting –Operations workshop  Focused of last one was OSG-EGEE interoperations, much progress achieved –Regular phone calls to make progress on specific areas Operations tools: common and/or interoperable –Global BDII extracted from EGEE and OSG registration DBs –GGUS interfaced to OSG FootPrints –Site/service monitoring tools interfacing being discussed Security: work is underway to share security contact information and incident information –Cross population of mail lists –EGEE sites in the OSG lists  And vice-versa –Technical details still to be agreed  Read access to GOC-DB etc –Ensure consistent (and many times common) policies through joint working groups.

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Problem Reports 3 WLCG ROCs in the US: US-ATLAS, US-CMS, OSG-GOC. All tickets routed from WLCG through OSG-GOC. OSG GOC and EGEE GGUS exchange and automatically route tickets. OSG-GOC automatically routes tickets to US-CMS-ROC and, currently, manually routes tickets to US-ATLAS-ROC

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September EGEE OSG Activities Completed –Interoperation of information published in BDII for use by WLCG Resource Brokers. In progress –Operations VO, “Ops” on EGEE and OSG for common tests and validations. –Programmatic interface to trouble ticket sysetm which allows retrieval of EGEE - OSG resource scheduled downtimes. To watch for –How do communicate and test interoperability of changes (interfaces and capabilities) before they get to production? –How do we communicate about new s/w developments in time to have common approaches & avoid duplication & divergence? –How do we manage ourselves to not give in to “panic mode” responses & give ourselves time to not organize “just in time”. –How do we prioritize support for our non-WLCG stakeholders during data taking?

Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Summary WLCG Operations is a focus of EGEE and OSG Operations. The 2 grid infrastructures are working together to ensure smooth, scalable, and effective production support.