Operating the LCG and EGEE Production Grid for HEP

Operating the LCG and EGEE Production Grid for HEP
Ian Bird IT Department, CERN LCG Deployment Area Manager & EGEE Operations Manager CHEP‘04 28th September 2004 EGEE is a project funded by the European Union under contract IST CHEP’04 – 28 September

LCG Operations in 2004 Goal: - deploy & operate a prototype LHC computing environment Scope: Integrate a set of middleware and coordinate and support its deployment to the regional centres Provide operational services to enable running as a production-quality service Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support Deployment Goals for LCG-2 Production service for Data Challenges in 2004 Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service Understand how LCG can be integrated into the sites’ physics computing services Set up the EGEE project and migrate the existing structure towards EGEE structure By design LCG and EGEE services and operations teams are the same CHEP’04 – 28 September

LCG – from certification to production
Some history: March 2003 LCG-0 existing middleware, waiting for EDG-2 release September 2003 LCG-1 3 month late -> reduced functionality extensive certification process -> improved stability (RB, Information system) integrated 32 sites ~300 CPUs first use for production December 2003 LCG-2 Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) Introduced a pre-production service for the experiments Alternative packaging (tool based and generic installation guides) Mai > now monthly incremental releases Not all releases are distributed to external sites Improved services, functionality, stability and packing step by step Timely response to experiences from the data challenges The formal certification process has been invaluable The process to stabilise existing middleware and put in production is expensive Testbeds, people, time Now have monthly incremental middleware releases Not all are deployed Expand now with a pre-production service CHEP’04 – 28 September

LCG-2 Software LCG-2 core packages:
VDT (Globus2, condor) EDG WP1 (Resource Broker, job submission tools) EDG WP2 (Replica Management tools) + lcg tools One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with significant LCG enhancements (replacements, simplified (see poster)) Mechanism for application (experiment) software distribution Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) Still far from perfect data management CHEP’04 – 28 September

Total: 78 Sites ~9000 CPUs 6.5 PByte LCG-2/EGEE-0 Status 24-09-2004
Cyprus Total: Sites ~9000 CPUs PByte LCG-2/EGEE-0 Status CHEP’04 – 28 September

Experiences in deployment
LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging, installation, and configuration tools and procedures A lot of effort had to be invested in this area There are many problems – but in the end we are quite successful System is reasonably stable System is used in production System is reasonably easy to install now ~80 sites Now have a basis on which to incrementally build essential functionality, and from which to measure improvements This infrastructure now also forms the EGEE production service CHEP’04 – 28 September

Operations services for LCG – 2004
Deployment and Operational support Hierarchical model CERN acts as 1st level support for the Tier 1 centres Tier 1 centres provide 1st level support for associated Tier 2s “Tier 1 sites”  “Primary sites” Grid Operations Centres (GOC) Provide operational monitoring, troubleshooting, coordination of incident response, etc. RAL (UK) led sub-project to prototype a GOC Operations support from CERN team, GOC, and Taipei, with many individual contributions on the mailing list User support Central model FZK provides user support portal Problem tracking system web-based and available to all LCG participants Experiments provide triage of problems CERN team provide in-depth support and support for integration of experiment sw with grid middleware CHEP’04 – 28 September

Experiences during the data challenges
CHEP’04 – 28 September

Large scale production effort of the LHC experiments
Data Challenges Large scale production effort of the LHC experiments test and validate the computing models produce needed simulated data test experiments production frame works and software test the provided grid middleware test the services provided by LCG-2 All experiments used LCG-2 for all or part of their productions CHEP’04 – 28 September

Data Challenges – ALICE
Phase I 120k Pb+Pb events produced in 56k jobs 1.3 million files (26TByte) in Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years) ~25% produced on LCG-2 Phase II (underway) 1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU ~15% on LCG-2 CHEP’04 – 28 September

Data Challenges – ATLAS
Phase I 7.7 Million events fully simulated (Geant 4) in jobs 22 TByte Total CPU: 972 MSI-2k hours >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid) CHEP’04 – 28 September

Data Challenges – CMS ~30 M events produced 25Hz reached
(only once for a full day) RLS, Castor, control systems, T1 storage, … Not a CPU challenge, but a full chain demonstration Pre-challenge production in 2003/04 70 M Monte Carlo events (30M with Geant-4) produced Classic and grid (CMS/LCG-0, LCG-1, Grid3) productions CHEP’04 – 28 September

This is 5-6 times what was possible at CERN alone
Data Challenges – LHCb Phase I 186 M events 61 TByte Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites) Up to 5600 concurrent running jobs in LCG-2 This is 5-6 times what was possible at CERN alone /day LCG restarted LCG paused LCG in action /day DIRAC alone CHEP’04 – 28 September

Data challenges – summary
Probably the first time such a set of large scale grid productions has been done Significant efforts invested on all sides – very fruitful collaborations Unfortunately, DCs were first time the LCG-2 system had been used Adaptations were essential – adapting experiment software to middleware and vice-versa – as limitations/capabilities were exposed Many problems were recognised and addressed during the challenges Systematic confrontation of the functional problems with experiment requirements has recently been made (GAG) Middleware is actually quite stable now But – job efficiency is not high – for many reasons (see below) Started to see some basic underlying issues: Of implementation (lack of error handling, scalability, etc) Of underlying models (workload management) Perhaps also of fabric services – batch systems ? But – single largest issue is lack of stable operations CHEP’04 – 28 September

Problems during the data challenges
Common functional issues seen by all experiments: Sites suffering from configuration and operational problems inadequate resources on some sites (hardware, human..) this is now the main source of failures Load balancing between different sites is problematic jobs can be “attracted” to sites that do not have adequate resources modern batch systems are too complex and dynamic to summarize their behaviour in a few values in the IS Identification of problems in LCG-2 is difficult distributed environment, access to many log files needed….. status of monitoring tools Handling thousands of jobs is time consuming and tedious Support for bulk operation is not adequate Performance and scalability of services storage (access and number of files) job submission information system file catalogues Services suffered from hardware problems DC summary CHEP’04 – 28 September

Configuration and stability problems
This is the largest source of problems Many are “well-known” fabric problems Batch systems that cause “black holes” NFS problems Clock skew at a site Software not installed or configured correctly Lack of configuration management – fixed problems reappear Firewall issues – often less than optimal coordination between grid admins and firewall maintainers Others are due to lack of experience Many grid sites have not run services before, do not have procedures, tools, diagnostics Not limited to small sites Lack of support Maintaining stable operation is labour intensive still – requires adequate operations staff trained in grid management Slow response – problems reported daily – but may last for weeks No vacations … Experiments expect 24x365 stable operation Grid successfully integrates these problems from 80 sites Building a stable operation is the highest priority This is what EGEE is funded to do CHEP’04 – 28 September

EGEE and Evolving the Operations model
CHEP’04 – 28 September

EGEE Goal Scale Activities Builds on:
Applications Geant network Grid infrastructure EGEE Goal Create a Europe-wide production quality grid infrastructure on top of present regional grid programs despite it’s name the project has a worldwide scope multi science project Scale 70 leading institutes in 27 countries ~300 FTEs Aim: 20’000 CPUs Initially: 2 year project Activities 48% service activities (operation, support) 24% middleware re-engineering 28% management, training, dissemination, international cooperation Builds on: LCG to establish a grid operations service single team for deployment and operations Experience gained from running services for the LHC experiments HEP experiments are the pilot application for EGEE, together with bio-medical CHEP’04 – 28 September

LCG and EGEE Operations
EGEE is funded to operate and support a research grid infrastructure in Europe The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination, management, and 2nd level support Support activities are expanded with the provision of Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC) CHEP’04 – 28 September

Operations: LCG  EGEE in Europe
User support: Becomes hierarchical Through the Regional Operations Centres (ROC) Act as front-line support for user and operations issues Provide local knowledge and adaptations Coordination: At CERN (Operations Management Centre) and CIC for HEP-LHC Operational support: The LCG GOC is the model for the EGEE CICs CIC’s replace the European GOC at RAL Also run essential infrastructure services Provide support for other (non-LHC) applications Provide 2nd level support to ROCs CHEP’04 – 28 September

The Regional Operations Centres
The ROC organisation is the focus of EGEE operations activities: Coordinate and support deployment Coordinate and support operations Coordinate Resource Centre management Negotiate application access to resources within region Coordinate planning and reporting within region Negotiate and monitor SLA’s within the region Teams: Deployment team 24 hour support team (answers user and rc problems) Operations training at RC’s Organise tutorials for users The ROC is the first point of contact for all: New sites joining the grid and support for them New users and user support CHEP’04 – 28 September

Core Infrastructure Centres
“Grid Operations Centres” – behaving as a single organisation Operate infrastructure services, e.g.: VO services: VO servers, VO registration service RBs, UIs, Information services RLS and other database services Ensure recovery procedures and fail-over (between CICs) Act as Grid Operations Centres Monitoring, proactive troubleshooting Performance monitoring Control sites’ participation in production service Use work done at RAL for LCG GOC as starting point Support to ROCs for operational problems Operational configuration management and change control Accounting and resource usage/availability monitoring Take responsibility for operational “control” (tbd) – rotates through 4 CICs CHEP’04 – 28 September

Future activities CHEP’04 – 28 September

Future activities All experiments expect to have significant ongoing productions for the foreseeable future Some will also have next data challenges 1 year from now LCG will run a series of “service challenges” Complementary to data challenges/ongoing productions Demonstrate essential service-level issues (e.g. Tier0-1 reliable data transfer) Essential that we are able to build a manageable production service Based on existing infrastructure Reasonable improvements In parallel build a “pre-production” service where: New middleware (gLite, …) can be demonstrated and validated before being deployed in production Understand the migration strategy to 2nd generation middleware Use the existing production service as the baseline comparison It takes a long time to make new software production quality Must be careful not to step backwards – even though what we have is far from perfect CHEP’04 – 28 September

What next?  Service challenges
Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned earlier Demonstrate that operational and emergency procedures are in place 4 areas proposed: Reliable data transfer Demonstrate fundamental service for Tier 0  Tier 1 by end 2004 Job flooding/exerciser Understand the limitations and baseline performances of the system May be achieved by the ongoing real productions Incident response Ensure the procedures are in place and work – before real life tests them Interoperability How can we bring together the different grid infrastructures? CHEP’04 – 28 September

Issues Operational management
How much control can/should be assumed by an operations centre? Small sites with little support – can GOCs restart services? More intelligence in the services to recognise problems Strong organisation to take operational responsibility Ensure that problems are addressed, traced, reported Need site management to take responsibility Ensure that Operational security group is in place with good communications Simplify service configurations – to avoid mistakes Weight of VOs EGEE has many VOs (most still national in scope) Deploying a VO is very heavyweight – must become much simpler CHEP’04 – 28 September

There is still much to be done
Summary Data challenges have been running for 8 months Major effort and collaboration between all involved Distributed operations in such a large system are hard Requires significant effort – EGEE will help here Many lessons have been learned Essential that 2nd generation middleware takes account of all these issues Not just functionality, but manageability, scalability, accountability, robustness  operational requirements are important requirements for users too … Now moving to a phase of sustained, continuous operation While building a parallel service to validate next generation middleware We have come a long way in the last few months There is still much to be done CHEP’04 – 28 September

Related papers Distributed Computing Services track
Evolution of Data Management in LCG-2 (278); Jean-Philippe Baud Distributed Computing Systems and Experiences track Deploying and Operating LCG-2 (389); Markus Schulz Many other papers on experience in using LCG-2 Poster Session 2 Several papers on LCG-2 – all aspects: certification, usage, information systems, integration/certification CHEP’04 – 28 September

Operating the LCG and EGEE Production Grid for HEP

Similar presentations

Presentation on theme: "Operating the LCG and EGEE Production Grid for HEP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Operating the LCG and EGEE Production Grid for HEP

Similar presentations

Presentation on theme: "Operating the LCG and EGEE Production Grid for HEP"— Presentation transcript:

Similar presentations

About project

Feedback