Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE 20 th September.

Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE J.Coles@rl.ac.uk 20 th September 2005

Overview 2 The middleware and its deployment 3 Structures developed in response to operating a large grid 4 How the infrastructure is being used 5 Particular problems being faced 6 Summary 1 Project Background (to EGEE, LCG and GridPP)

A reminder of the Enabling Grids for E-sciencE project 48 % service activities (Grid Operations, Support and Management, Network Resource Provision) 24 % middleware re-engineering (Quality Assurance, Security, Network Services Development) 28 % networking (Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation) 32 Million Euros EU funding over 2 years starting 1 st April 2004 Emphasis in EGEE is on operating a production grid and supporting the end-users From Bob Joness talk AHM 2004!

The UK & Ireland contribution to SA1 – deployment & operations Consists of 3 partners: Grid Ireland

The UK & Ireland contribution to SA1 – deployment & operations Consists of 3 partners: Grid Ireland The National Grid Service (NGS) - Leeds/Manchester/Oxford/RAL

The UK & Ireland contribution to SA1 – deployment & operations Consists of 3 partners: Grid Ireland The National Grid Service (NGS) GridPP Currently the lead partner Based on a Tier-2 structure

The UK & Ireland contribution to SA1 – deployment & operations Consists of 3 partners: Grid Ireland The National Grid Service (NGS) GridPP Currently the lead partner Based on a Tier-2 structure within the Large Hadron Collider Grid Project (LCG) [See T Doyles talk tomorrow 11am CR2] The UKI structure: Regional Operations Centre (ROC) Helpdesk Communications Liaison with ROCs and CICs Monitoring of resources Core Infrastructure Centre (CIC) Team take shifts to … Monitor core services and Follow up on site problems

GridPP is a major contributor to the growth of EGEE resources

When sites join EGEE the ROC … Records site details in a central Grid Operations Centre DataBase (GOCDB) with access certificate controlled Ensures that the site has agreed to and signed the Acceptable Use and Incident Response procedures Runs tests against the site to ensure that the setup is correctly configured NB. Page access requires appropriate grid certificate

Experience has revealed growing requirements for the GOCDB ROC manager control - To be able to update site information and change the monitoring status for or remove sites A structure that allows easy population of structured views (such as accounting according to regional structures) To be able to differentiate pure production sites from test resources (e.g. preproduction services)

Computing clusterNetwork resourcesData storage Operating systemLocal schedulerFile system User accessSecurityData transferInformation schema Resource BrokerData managementApp monitoring system User interfaces Applications Hardware System software Basic services Collective services Application level services dCache-SRM, DPM… Scientific Linux, RHEL… NFS, … PBS, Condor, LSF,… VDT (Condor, Globus, GLUE) EU DataGrid Information system EGEE middleware is still evolving based on operational needs

An overview of the (changing) middleware release process Release(s) Certification is run daily Update User Guides EIS Update Release Notes GIS Release Notes Installation Guides User Guides Re-Certify CIC Every Month 11 Release Client Release Deploy Client Releases (User Space) GIS Deploy Service Releases (Optional) CICs RCs CICs RCs Deploy Major Releases (Mandatory) ROCs RCs ROCs RCs YAIM Every Month Every 3 months on fixed dates ! at own pace Site deployment of middleware YAIM – bash script. Simple and transparent. Much preferred by administrators. QUATTOR – Steep learning curve but allows tighter control over installation. Patches & functionality Vs stability! Porting to non-standard LCG operating systems

A mixed infrastructure is inevitable and local variations must be manageable Releases take time to be adopted – how will more frequent updates be tagged and handled!? Grid Ireland has a completely different deployment model to GridPP (central vs site based)

Additional components are added such as for managed storage Storage Resource Management interface Provides a protocol for large scale storage systems on the grid Clients can retrieve and store files, control file lifetimes and filespace Sites will need to offer an SRM compliant storage element to VOs These SEs are basically filesystem mount points on specific servers There are few solutions available and deployment at test sites has proved time consuming (integration at sites, understanding hardware setup (documentation improving)) The Grid Storage Element interfaces Handlers TAPE storage (or disk) Access Control File Metadata

Once sites are part of the grid they are actively monitored The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc. These have recently been updated as certain critical tests gave a misleading impression of a site

Once sites are part of the grid they are actively monitored The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc. These have recently been updated as certain critical tests gave a misleading impression of a site The tests are being used (and expanded) by Virtual Organisations (VOs) to select stable sites (to improve efficiency)

Once sites are part of the grid they are actively monitored The Site Functional Tests (SFTs) are a series of jobs reporting whether a site is able to do basic transfers, publishes required information etc. These have recently been updated as certain critical tests gave a misleading impression of a site The tests are being used (and expanded) by Virtual Organisations (VOs) to select stable sites (to improve efficiency) They have proved very useful to sites and can now be run by them on demand

The tests form part of a suite of information used by the Core Infrastructure Centres (CICs) There are currently 5 CICs in EGEE Introduction of a CIC on Duty rota (whereby each CIC oversees EGEE operations for 1 week at a time) saw a great improvement in grid stability Available information is captured in a Trouble Ticket and sent to problem sites (and their ROC) informing them that there is a problem Tickets are automatically escalated if not resolved Core services are monitored in addition to sites

Good, reliable and easy to access information has been extremely useful to sites and ROC staff At a glance we can see for each site: whether it passes or fails the functional tests if there are configuration errors (via sanity checks) what middleware version is deployed the total job slots available and used as published by the site basic storage information average and maximum published jobs slots showing deviations

With a rapidly growing number of sites and geographic coverage many tools have had to evolve

And new ones developed. EGEE and LCG metrics are an increasing area of focus – how else are we to manage!

We need to develop a better understanding of grid dynamics Is this several sites with large farms upgrading? Is this the result of a loss of the Tier-1 scheduler? Or just a problem with the tests!

The good news is that UKI is currently the largest contributor to EGEE resources

… and resource usage is growing (at 55% for August and 26% for period from June 04 Utilisation may worry some people but note that the majority of resources are being deployed for High Energy Physics experiments which will ramp up usage quickly in 2007 Recent activity is due partly due to a Biomedical data challenge in August

Several sites have been running full for July/August. The plot below is for the Tier- 1 in August

However full does not always mean well used! The plot shows weighted job efficiencies for the ATLAS VO in July 2005 Straight line structures show jobs which ran for a period of time before blocking on an external resource and eventually being killed by an elapsed time limit Clusters at low efficiency probably show performance problems on external storage elements Many problems seen here are NOW FIXED

… and some sites have specific scheduling requirements Batch Server (pbsserver) Execution host (pbsmom Execution host (pbsmom) Execution host (pbsmom) Batch server and cluster Configuration, Job queue, State table Execution host (pbsmom Job, start, stop, status qsub, qdel, qstat Scheduler plug-in Node, job, start, stop status Job, start, stop, status Scheduler and additional cluster Configuration Grid scheduling (using user specified requirements to select resources) Vs Local policies (the site prefers certain VOs)

The user community is expanding creating new problems Over 900 users in some 60+ VOs UK sites support about 10 VOs Opening up resources for non- traditional site VOs/users requires effort Negotiation between VOs and the regional sites has required the creation of an Operational Advisory Group New Acceptable Use policies which apply across countries and agreeable (and actually readable) are taking time to develop. PilotNew

Aggregation of job accounting is recording VO usage GOC SITE Web summary view of data

Aggregation of job accounting is recording VO usage, but … GOC SITE Web summary view of data Not all batch systems are covered Not all sites are publishing data Farm normalisation factors are not consistent Publishing across grids yet to be tackled (but the solution in EGEE does use a GGF schema)

GridPP data is reasonably complete for recent months Note the usage by non particle physics organisations. This is what the EGEE grid is all about.

Support is proving difficult because the project is so large and diverse UKI ROC ticket tracking system (Footprints) Site A GGUS (Remedy) Regional service 1 Tier-1 helpdesk (Request tracker) Grid-Ireland helpdesk (Request Tracker) GOSC (Footprints) CIC-on-duty Users Experiments/VOs Savannah – bug tracking Site administrators LCG-ROLLOUT TB-SUPPORT This is ONLY the view for the UKI operations centre. There are 9 ROCs

TPM VO Support TPM I need help! I send e-mail to vo-user-support@ggus.org E-mail automatically converted in GGUS ticket. Can be addressed to TPM VO only, or TPM only, or to both VO Support Units ROC Support Units Middleware Support Units Other Grids Support Units Mailing lists Ticket Process Manager: Monitor ticket assignments. Direct to correct support unit. Notify users of specific actions and ticket status TPM VO Support: People from VOs. Receive tickets VO related and follow them. Solve/forward VO specific problems. Recognize Grid related problems and assign them to specific support units or back to TPM CIC Support Unit The EGEE model uses a central helpdesk facility and Ticket Process Managers

Mailing lists are very active on their own! Linking up ROC helpdesks is taking time. Getting VOs to populate their follow up lists is not happening quickly TPM VO Support TPM Some users are confused - mixed messages The central GGUS facility is taking time to become stable VO Support Units Ticket Process Managers are difficult to provide – EGEE funding did not account for them VOs still have independent support lists and routes – especially the larger VOs The EGEE model uses a central helpdesk facility and ticket process managers, but …

Interoperability is another area to be developed In terms of: Operations Support Job submission Job monitoring … Currently the VOs/experiments develop their own solutions to this problem.

Some other areas which are talks in themselves! Security Getting all sites to adopt best practices check patches check port changes reviewing log files Scanning for grid wide intrusion Network monitoring Aggregation of data from site network boxes Mediator for integrated network checks

Going forward, one of the main drivers pushing the service is a series of service challenges in LCG { Main UK site connected to CERN via UKLIGHT Up to 650 Mb/s sustained transfers 3 Tier-2 centres deployed an SRM and managed sustained data transfer rates up to 550 Mb/s over SJ4. One connected via UKLIGHT

Summary 2 Our grid management tools are now evolving rapidly 3 Grid utilisation is improving – we start to look at the dynamics 4 Growing focus areas include support and interoperation (and gLite!) 6 Come and visit the GridPP (PPARC) and CCLRC stands! 1 UK&I has a strong presence in EGEE and LCG 5 There is a lot of work not covered here! Fabric:Security:Networking…

SITE FIREMAN VOMS LFC shared LCG gLite SRM-SE myProxy gLite WLM RB UIs WNs gLiteLCG gLite-IO gLite-CE FTS LCG CE FTS R-GMA BD-II Data from LCG is owned by VO and role, gLite-IO service owns gLite data FTS for LCG uses user proxy, gLite uses service cert R-GMAs can be merged (security ON) CEs use same batch system Independent IS Catalogue and access control gLite vs LCG-2 Components gLite vs LCG-2 Components dgas APEL

Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE 20 th September.

Similar presentations

Presentation on theme: "Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE 20 th September."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE 20 th September.

Similar presentations

Presentation on theme: "Grid Deployment & Operations: EGEE, LCG and GridPP Jeremy Coles GridPP Production Manager UK&I Operations Manager for EGEE 20 th September."— Presentation transcript:

Similar presentations

About project

Feedback