EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN

EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN markus.schulz@cern.ch

OSG Operations Workshop CERN IT-GD December 2004 2 Outline LCG  software EGEE History of LCG production service Impact of Data Challenges on operations Problems Operating LCG  Preparing Releases  Support how it was planned how it was done Summary of the operations workshop at CERN New Structure Summary Interoperation (status by L. Field)

OSG Operations Workshop CERN IT-GD December 2004 3 EGEE in a nutshell Goal  Create a European wide production quality grid infrastructure on top of present regional grid programs despite it’s name the project has a worldwide scope multi science project Scale  70 leading institutes in 27 countries  ~300 FTEs  Aim: 20’000 CPUs  Initially 2 years project Activities  48% service activities (operation, support)  24% middleware re-engineering  28% management, training, dissemination, international cooperation Builds on:  LCG to establish a grid operations service joint team for deployment and operations  Experience gained from running services for the LHC experiments HEP experiments are the pilot application for EGEE

OSG Operations Workshop CERN IT-GD December 2004 4 EGEE Middleware New design driven by requirements of Experiments, Bio-Medicals and Operations (strong multi science aspect) Process includes partners from EU and USA  Involves experienced Middleware providers from AliEn, EDG, VDT  Monthly meetings in EU and USA Prototyping approach as required by ARDA  Allowing for rapid release cycles and fast feedback from early adopters Formal Integration & Testing mechanisms driven from CERN  Should ensure quality & coherence amongst the developments coming from distributed teams  Includes formal defect tracking system First “stabilized” version to be available by the end of the year  Initial prototype however made available as of May’04, with currently 2 releases/month tackling users/testing feedback  Target is to deploy components onto the LCG preproduction service asap.

OSG Operations Workshop CERN IT-GD December 2004 5 The LCG Project (and what it isn’t) Mission  To prepare, deploy and operate the computing environment for the experiments to analyze the data from the LHC detectors Two phases:  Phase 1: 2002 – 2005  Build a prototype, based on existing grid middleware  Deploy and run a production service  Produce the Technical Design Report for the final system  Phase 2: 2006 – 2008  Build and commission the initial LHC computing environment LCG is NOT a development project for middleware but problem fixing is permitted (even if writing code is required) LCG-2 is the first production service for EGEE Ian Bird is Operations Officer for both projects

OSG Operations Workshop CERN IT-GD December 2004 6 LCG-2 Software LCG-2 core packages:  VDT (Globus2, condor)  EDG: Resource Broker, job submission tools Replica Management tools + lcg tools –One central RMC and LRC for each VO, located at CERN, ORACLE backend  SRM + gridFtp based access to MSS (Castor, dCache)  Several bits from other WPs (Config objects, InfoProviders, Packaging…)  GLUE 1.1 (Information schema) + few essential LCG extensions  (MDS) based Information System with significant LCG enhancements (replacements, simplified, scalability, LCG-BDII)  Mechanism for application (experiment) software distribution  VOMs (in preparation) Almost all components have gone through some reengineering  robustness, scalability,efficiency  adaptation to local fabrics The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture)

OSG Operations Workshop CERN IT-GD December 2004 7 Experience Jan 2003 GDB agreed to take VDT and EDG components September 2003 LCG-1  Extensive certification process  Integrated 32 sites ~300 CPUs first use for production December 2003 LCG-2  Deployed in January to 8 core sites  Introduced a pre-production service for the experiments  Alternative packaging (tool based and generic installation guides) Mai 2004 -> now monthly incremental releases (not all distributed)  Driven by the experiences from the data challenges  Balance between stable operation and improved versions (driven by users)  2-1-0, 2-1-1, 2-2-0, 2-3-0  Production services RBs + BDIIs patched on demand  > 90 sites ~9300 CPUs (3-5 failed to come online)

OSG Operations Workshop CERN IT-GD December 2004 8 Adding Sites Sites contact GD Group or Regional Operation Center Sites go to the release pagerelease Sites decide on manual or tool based installation  Sites provide security and contact information  GD forwards this to GOC and security officer  >200 pages of documentation and FAQs are available Sites install and use provided tests for debugging  large sites integrate their local batch system  support from ROCs or CERN CERN GD certifies sites  adds them to the monitoring and information system  sites are daily re-certified and problems traced in SAVANNAH Experiments install their software and add the site to their IS Adding new sites is now a quite smooth process  this takes between a few days to few weeks worked 90+ times failed 3-5 times

OSG Operations Workshop CERN IT-GD December 2004 9 e-mail Deployment Team/ROC LCG Security Group Grid Operations Center Deployment Team/ROC LCG Security Group Grid Operations Center NewSite DT/ROC SG Adding a Site Request to join Initial advice, Installation Guides GOC Filled site contact form ACK Security contact information Security policies Adds site to GOC-DB Registers with GOC-DB install/ local tests support reports status and problems support and advice certify asks for certification correct reports on cert result Add site to information system Add site to monitoring/ map

OSG Operations Workshop CERN IT-GD December 2004 10 LCG-2 Status 18/11/2004 Total: 90 Sites ~9500 CPUs ~6.5 PByte Cyprus new interested sites should look here: releaserelease

OSG Operations Workshop CERN IT-GD December 2004 11 Preparing a Release Monthly process  Gathering of new material  Prioritization  Integration of items on list  Deployment on testbeds  First tests feedback  Release to EIS testbed for experiment validation  Full testing (functional and stress) feedback to patch/component providers final list of new components  Internal release (LCFGng) On demand  Preparation/Update of release notes for LCFGng  Preparation/Update of manual install documentation  Test installations on GIS testbeds  Announcement on the LCG-Rollout list C&T Certification & Testing C&T Certification & Testing EIS Experiment Integration Support EIS Experiment Integration Support GIS Grid Infrastructure Support GIS Grid Infrastructure Support GDB Grid Deployment Board GDB Grid Deployment Board Applications Sites

OSG Operations Workshop CERN IT-GD December 2004 12 Preparing a Release Initial List,Prioritization,Integration,EIS,StressTest C&T EIS GIS GDB Applications Sites Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah e-mail C&T EIS GIS Head of Deployment Head of Deployment prioritization & selection Developers Applications LCFGng & change record Developers Wish list for next release Wish list for next release 1 1 List for next release (can be empty) List for next release (can be empty) 2 2 integration & first tests C&T 3 3 Internal Release Internal Release 4 4 Deployment on EIS testbed EIS 5 5 full deployment on test clusters (6) functional/stress tests ~1 week C&T 6 6 Final Internal Release Final Internal Release 7 7

OSG Operations Workshop CERN IT-GD December 2004 13 Preparing a Release Preparations for Distribution, Upgrading LCFGng & change record Final Internal Release Final Internal Release Synchronize Applications Sites Sites upgrade at own pace Certification is run daily Update User Guides EIS Update Release Notes Finalize LCFGng Conf. Prepare Manual Guide GIS 8 8 8 8 Release Notes Installation Guides User Guides 9 9 9 9 LCFGng Install Test Manual Install Test GIS 10 Release 11 Announce Release on the LCG-Rollout list GIS 12 Upgrade Install Sites 13 Re-Certify GIS 14

OSG Operations Workshop CERN IT-GD December 2004 14 Process Experience The process was decisive to improve the quality of the middleware The process is time consuming  There are many sequential operations  The format of the internal and external release will be unified  Multiple packaging formats slow down release preparation tool based (LCFGng) manual (+tar ball based)  All components are treated equal same level of testing for core components and non vital tools special process for acceptin tools already in use by other project needed Process of including new components not sufficient transparent Picking a good time for a new release is difficult  conflict between users (NOW) and sites (planned) Upgrading has proven to be a high risk operation  some sites suffered from acute configuration amnesia Process was one of the topics in the “LCG Operations Workshop”

OSG Operations Workshop CERN IT-GD December 2004 15 Impact of Data Challenges Large scale production effort of the LHC experiments  test and validate the computing models  produce needed simulated data  test experiments production frame works and software  test the provided grid middleware  test the services provided by LCG-2 All experiments used LCG-2 for part of their production

OSG Operations Workshop CERN IT-GD December 2004 16 Data Challenges Phase I  120k Pb+Pb events produced in 56k jobs  1.3 million files (26TByte) in Castor@CERN  Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)  ~25% produced on LCG-2  Phase II (underway)  1 million jobs, 10 TB produced, 200TB transferred,500 MSI2k hours CPU  ~15% on LCG-2 Phase I  7.7 Million events fully simulated (Geant 4) in 95.000 jobs  22 TByte  Total CPU: 972 MSI-2k hours  >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)

OSG Operations Workshop CERN IT-GD December 2004 17 Data Challenges ~30 M events produced 25Hz reached (only once for a full day)

OSG Operations Workshop CERN IT-GD December 2004 18 DIRAC alone LCG in action 1.8 10 6 /day LCG paused 3-5 10 6 /day LCG restarted Data Challenges Phase I  186 M events 61 TByte  Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)  Up to 5600 concurrent running jobs in LCG-2

OSG Operations Workshop CERN IT-GD December 2004 19 Problems during the data challenges All experiments encountered on LCG-2 similar problems LCG sites suffering from configuration and operational problems  not adequate resources on some sites (hardware, human..)  this is now the main source of failures Load balancing between different sites is problematic  jobs can be “attracted” to sites that have no adequate resources  modern batch systems are too complex and dynamic to summarize their behavior in a few values in the IS Identification and location of problems in LCG-2 is difficult  distributed environment, access to many logfiles needed (but hard)…..  status of monitoring tools Handling thousands of jobs is time consuming and tedious  Support for bulk operation is not adequate Performance and scalability of services  storage (access and number of files)  job submission  information system  file catalogues Services suffered from hardware problems  (no fail over (design problem)) DC summary

OSG Operations Workshop CERN IT-GD December 2004 20 Operational issues (selection) Slow response from sites  Upgrades, response to problems, etc.  Problems reported daily – some problems last for weeks  Lack of staff available to fix problems Vacation period, other high priority tasks Various mis-configurations (see next slide) Lack of configuration management – problems that are fixed re-appear Lack of fabric management (mostly smaller sites)  scratch space, single nodes drain queues, incomplete upgrades, …. Lack of understanding  Admins reformat disks of SE … Provided documentation often not (carefully) read  new activity to develop adaptive documentation  simpler way to install middleware (YAIM)  opens ways to maintain middleware remotely in user space Firewall issues –  often less than optimal coordination between grid admins and firewall maintainers openPBS problems  Scalability, robustness (switching to torque helps)

OSG Operations Workshop CERN IT-GD December 2004 21 Site (mis) - configurations Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems:  – The variable VO SW DIR points to a non existent area on WNs.  – The ESM is not allowed to write in the area dedicated to the software installation  – Only one certificate allowed to be mapped to the ESM local account  – Wrong information published in the information system (Glue Object Classes not linked)  – Queue time limits published in minutes instead of seconds and not normalized  – /etc/ld.so.conf not properly configured. Shared libraries not found.  – Machines not synchronized in time  – Grid-mapfiles not properly built  – Pool accounts not created but the rest of the tools configured with pool accounts  – Firewall issues  – CA files not properly installed  – NFS problems for home directories or ESM areas  – Services configured to use the wrong/no Information Index (BDII)  – Wrong user profiles  – Default user shell environment too big  Only partly related to middleware complexity Training for system administrators started, first courses organized by EGEE-NA3 integrated all common small problems into ONE BIG PROBLEM

OSG Operations Workshop CERN IT-GD December 2004 22 Operating Services for DCs Multiple instances of core services for each of the experiments  separates problems, avoids interference between experiments  improves availability  allows experiments to maintain individual configuration  addresses scalability to some degree Monitoring tools for services currently not adequate  tools under development to implement control system  moving tools to common transport and storage format (R-GMA) Access to storage via load balanced interfaces  CASTOR  dCache Load balancing service for the Information system index service  load balanced BDII deployed at CERN DC summary

OSG Operations Workshop CERN IT-GD December 2004 23 Support during the DCs User (Experiment) Support:  GD at CERN worked very close with the experiments production managers  Informal exchange (e-mail, meetings, phone) “No Secrets” approach, GD people on experiments mail lists and vice versa –ensured fast response tracking of problems tedious, but both sites have been patient clear learning curve on BOTH sites LCG GGUS (grid user support) at FZK became operational after start of the DCs –due to the importance of the DCs the experiments switch slowly to the new service Very good end user documentation by GD-EIS Dedicated testbed for experiments with next LCG-2 release –rapid feedback, influenced what made it into the next release Installation and site operations support:  GD prepared releases and supported sites (certification, re-certification)  Regional centres supported their local sites (some more, some less)  Community style help via mailing list (high traffic!!)  FAQ lists for trouble shooting and configuration issues: Taipei RALTaipeiRAL

OSG Operations Workshop CERN IT-GD December 2004 24 Support during the DCs Operations Service:  RAL (UK) is leading sub-project on developing operations services  Initial prototype http://www.grid-support.ac.uk/GOC/ http://www.grid-support.ac.uk/GOC/ Basic monitoring tools Mail lists for problem resolution Working on defining policies for operation, responsibilities (draft document) Working on grid wide accounting (APPLE)  Monitoring: GridICE (development of DataTag Nagios-based tools) GridPP job submission monitoring, every few hours, all RBs, allSites Information system monitoring and consistency check every 5 minutes http://goc.grid.sinica.edu.tw/gstat/ http://goc.grid.sinica.edu.tw/gstat/  CERN GD daily re-certification of sites (including history)sites (including history) escalation procedure tracing of site specific problems via problem tracking tool tests core services and configuration

OSG Operations Workshop CERN IT-GD December 2004 25 Screen Shots

OSG Operations Workshop CERN IT-GD December 2004 26 Screen Shots

OSG Operations Workshop CERN IT-GD December 2004 27 Some More Monitoring

OSG Operations Workshop CERN IT-GD December 2004 28 Monitoring and Controls Many monitoring tools and sources of information available Hard to combine information to spot problems early Split of monitoring into three parts  sensors  transport and storage  display Transport and storage based on R-GMA “monitoring bus”  Already ported: GIIS monitor, Re-Certification, Jobsubmission, (GridICE, (LB of RB)) general display based on R-GMA  Building of complex alarms via sql queries Controls  Taipei is building a message system that can be used for interaction with sites

OSG Operations Workshop CERN IT-GD December 2004 29 Problem Handling PLAN for LCG VO AVO BVO C GD CERN GGUS (Remedy) GOC P-Site-1 S-Site-1S-Site-2 P-Site-2 S-Site-2S-Site-1 Triage: VO / GRID Monitoring/Follow-up Escalation

OSG Operations Workshop CERN IT-GD December 2004 30 Community Problem Handling Operation (most cases) VO A VO B VO C GD CERN GGUS GOC P-Site-1 S-Site-1S-Site-2 S-Site-1 Triage Monitoring FAQs Rollout Mailing List Monitoring Certification Follow-Up FAQs S-Site-3

OSG Operations Workshop CERN IT-GD December 2004 31 LCG Workshop on Operational Issues Motivation  LCG -> (LCG&EGEE) transition requires changes  Lessons learned need to be implemented  Many different activities need to be coordinated 02 - 04 November at CERN  >80 participants including from GRID3 and NorduGrid  Agenda: HereHere  1.5 days of plenary sessions describe status and stimulate discussion  1 day parallel/joint working groups very concrete work, results into creation of task lists with names attached to items  0.5 days of reports of the WG

OSG Operations Workshop CERN IT-GD December 2004 32 LCG Workshop on Operational Issues WGs I Operational Security  Incident Handling Process  Variance in site support availability  Reporting Channels  Service Challenges Operational Support  Workflow for operations & security actions  What tools are needed to implement the model  “24X7” global support sharing operational load (taking turns)  Communication  Problem Tracking System  Defining Responsibilities problem follow-up deployment of new releases  Interface to User Support

OSG Operations Workshop CERN IT-GD December 2004 33 LCG Workshop on Operational Issues WGs II Fabric Management  System installations  Batch/scheduling Systems  Fabric monitoring  Software installation  Representation of site status (load) in the Information System Software Management  Operations on and for VOs (add/remove/service discovery)  Fault tolerance, operations on running services (stop,upgrades, re-starts)  Link to developers  What level of intrusion can be tolerated on the WNs (farm nodes) application (experiment) software installation  Removing/(re-adding) sites with (fixed)troubles  Multiple views in the information system (maintenance)

OSG Operations Workshop CERN IT-GD December 2004 34 LCG Workshop on Operational Issues WGs III User Support  Defining what “User Support” means  Models for implementing a working user support need for a Central User Support Coordination Team (CUSC) –mandate and tasks distributed/central (CUSC/RUSC) workflow  VO-support continuous support on integrating the VOs software with the middleware end user documentation FAQs

OSG Operations Workshop CERN IT-GD December 2004 35 LCG Workshop on Operational Issues Summary Very productive workshop Partners (sites) assumed responsibility for tasks Discussions very much focused on practical matters Some problems ask for architectural changes  gLite has to address these It became clear that not all sites are created equal Removing troubled sites is inherently problematic  removing storage can have grid wide impact Key issues in all aspects is to define split between:  Local,Regional and Central control and responsibility All WGs discussed communication

OSG Operations Workshop CERN IT-GD December 2004 36 New Operations Model EGEE Structure  OMC Operations Management Center  CICs Core Infrastructure Centers services like file catalogues, RBs, “central infrastructure” operation support CERN,France,Italy,UK, (Russia,Taipei)  ROCs Regional Operation Centers regional support France,Italy,UK+Ireland,Germany+Switzerland,N-Europe,SW- Europe,Central Europe, Russia  RCs Resource Centers data and CPUs

OSG Operations Workshop CERN IT-GD December 2004 37 New Operations Model Operations Center role rotates through the CICs  CIC on duty for one week  Procedures and tasks are currently defined first operations manual is available (living document) –tools, frequency of checks, escalation procedures, hand over procedures CIC on duty website:website  Problems are tracked with a tracking tool now central in Savannah migration to GGUS (remedy) with link to ROCs PT tools problems can be added at GGUS or ROC level  CICs monitor service, spot and track problems interact with sites on short term problems (service restart etc,) interact with ROCs on longer, non trivial problems all communication with a site is visible for the ROC build FAQs  ROCs support installation, first certification resolving complex problems

OSG Operations Workshop CERN IT-GD December 2004 38 New Operations Model OMC CIC ROC RC Other Grid RC

OSG Operations Workshop CERN IT-GD December 2004 39 Summary LCG-2 services have been supporting the data challenges  Many middleware problems have been found – many addressed  Middleware itself is reasonably stable Biggest outstanding issues are related to providing and maintaining stable operations Future middleware has to take this into account:  Must be more manageable, trivial to configure and install  Management and monitoring must be built into services from the start on Operational Workshop has started many activities  Follow-up and keeping up the momentum is now essential  Indicates a clear shift away from the CERNtralized operation CIC on duty is a first step to distribute operational load

EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN

Similar presentations

Presentation on theme: "EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN

Similar presentations

Presentation on theme: "EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN"— Presentation transcript:

Similar presentations

About project

Feedback