EGEE is a project funded by the European Union under contract IST-2003-508833 “LCG2 Operational Experience and Status” Markus Schulz, IT-GD, CERN

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Last update 01/06/ :23 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD Site Registration policy & procedures
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
EGEE is a project funded by the European Union under contract IST “LCG and EGEE Operations” Markus Schulz, IT-GD, CERN
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
EGEE is a project funded by the European Union under contract IST JRA1 Testing Activity: Status and Plans Leanne Guy EGEE Middleware Testing.
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
EGEE ARM-2 – 5 Oct LCG Security Coordination Ian Neilson LCG Security Officer Grid Deployment Group CERN.
CERN LCG-1 Status and Issues Ian Neilson for LCG Deployment Group CERN Hepix 2003, Vancouver.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Security Area in GridPP2 4 Mar 2004 Security Area in GridPP2 “Proforma-2 posts” overview Deliverables – Local Access – Local Usage.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
LCG EGEE is a project funded by the European Union under contract IST LCG PEB, 7 th June 2004 Prototype Middleware Status Update Frédéric Hemmer.
15-Dec-04D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security Update (Report from the Joint Security Policy Group) CERN 15 December 2004 David Kelsey CCLRC/RAL,
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
EGEE is a project funded by the European Union under contract IST “LCG Operation During the Data Challenges” Markus Schulz, IT-GD, CERN
Grid Deployment Board – 10 February GD LCG Workshop Goals Give overview where we are Stimulate cooperation between the centres Improve the communication.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Production Manager’s Report PMB Jeremy Coles 13 rd September 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LCG Internal Review November 2003.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
CERN Running a LCG-2 Site – Oxford July - 1 LCG2 Administrator’s Course Oxford University, 19 th – 21 st July Developed.
EGEE ARM-2 – 5 Oct LCG/EGEE Security Coordination Ian Neilson Grid Deployment Group CERN.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
Components Selection Validation Integration Deployment What it could mean inside EGI
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.
LCG Workshop User Support Working Group 2-4 November 2004 – n o 1 Some thoughts on planning and organization of User Support in LCG/EGEE Flavia Donno LCG.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
CERN LCG1 to LCG2 Transition Markus Schulz LCG Workshop March 2004.
Bob Jones EGEE Technical Director
Regional Operations Centres Core infrastructure Centres
Operations Status Report
EGEE is a project funded by the European Union
EGEE Middleware Activities Overview
JRA3 Introduction Åke Edlund EGEE Security Head
SA1 Execution Plan Status and Issues
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
INFNGRID Workshop – Bari, Italy, October 2004
LCG Operations Workshop, e-IRG Workshop
Operating the LCG and EGEE Production Grid for HEP
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
Presentation transcript:

EGEE is a project funded by the European Union under contract IST “LCG2 Operational Experience and Status” Markus Schulz, IT-GD, CERN

LHC Review CERN IT-GD November Outline Building LCG-2 Impact of Data Challenges on operations Problems Operating LCG  how it was planned  how it was done Summary of the operations workshop at CERN

LHC Review CERN IT-GD November Experience Jan 2003 GDB agreed to take VDT and EDG components September 2003 LCG-1  Extensive certification process  Integrated 32 sites ~300 CPUs first use for production December 2003 LCG-2  Deployed in January to 8 core sites  Introduced a pre-production service for the experiments  Alternative packaging (tool based and generic installation guides) Mai > now monthly incremental releases (not all distributed)  Driven by the experiences from the data challenges  Balance between stable operation and improved versions (driven by users)  2-1-0, 2-1-1, 2-2-0, (2-3-0)  (Production services RBs + BDIIs patched on demand)  > 80 sites (3-5 failed)

LHC Review CERN IT-GD November Adding Sites Sites contact GD Group or Regional Operation Center Sites go to the release pagerelease Sites decide on manual or tool based installation  Sites provide security and contact information  GD forwards this to GOC and security officer Sites install and use provided tests for debugging  large sites integrate their local batch system  support from ROCs or CERN CERN GD certifies sites  adds them to the monitoring and information system  sites are daily re-certified and problems traced in SAVANNAH Experiments install their software and add the site to their IS Adding new sites is now a quite smooth process worked 80+ times failed 3-5 times

LHC Review CERN IT-GD November Deployment Team/ROC LCG Security Group Grid Operations Center Deployment Team/ROC LCG Security Group Grid Operations Center NewSite DT/ROC SG Adding a Site Request to join Initial advice, Installation Guides GOC Filled site contact form ACK Security contact information Security policies Adds site to GOC-DB Registers with GOC-DB install/ local tests support reports status and problems support and advice certify asks for certification correct reports on cert result Add site to information system Add site to monitoring/ map

LHC Review CERN IT-GD November LCG-2 Status 18/11/2004 Total: 91 Sites ~9500 CPUs ~6.5 PByte Cyprus new interested sites should look here: releaserelease

LHC Review CERN IT-GD November Preparing a Release Monthly process  Gathering of new material  Prioritization  Integration of items on list  Deployment on testbeds  First tests feedback  Release to EIS testbed for experiment validation  Full testing (functional and stress) feedback to patch/component providers final list of new components  Internal release (LCFGng) On demand  Preparation/Update of release notes for LCFGng  Preparation/Update of manual install documentation  Test installations on GIS testbeds  Announcement on the LCG-Rollout list C&T Certification & Testing C&T Certification & Testing EIS Experiment Integration Support EIS Experiment Integration Support GIS Grid Infrastructure Support GIS Grid Infrastructure Support GDB Grid Deployment Board GDB Grid Deployment Board Applications Sites

LHC Review CERN IT-GD November Preparing a Release Initial List,Prioritization,Integration,EIS,StressTest C&T EIS GIS GDB Applications Sites Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah C&T EIS GIS Head of Deployment Head of Deployment prioritization & selection Developers Applications LCFGng & change record Developers Wish list for next release Wish list for next release 1 1 List for next release (can be empty) List for next release (can be empty) 2 2 integration & first tests C&T 3 3 Internal Release Internal Release 4 4 Deployment on EIS testbed EIS 5 5 full deployment on test clusters (6) functional/stress tests ~1 week C&T 6 6 Final Internal Release Final Internal Release 7 7

LHC Review CERN IT-GD November Preparing a Release Preparations for Distribution, Upgrading LCFGng & change record Final Internal Release Final Internal Release Synchronize Applications Sites Sites upgrade at own pace Certification is run daily Update User Guides EIS Update Release Notes Finalize LCFGng Conf. Prepare Manual Guide GIS Release Notes Installation Guides User Guides LCFGng Install Test Manual Install Test GIS 10 Release 11 Announce Release on the LCG-Rollout list GIS 12 Upgrade Install Sites 13 Re-Certify GIS 14

LHC Review CERN IT-GD November Process Experience The process was decisive to improve the quality of the middleware The process is time consuming  There are many sequential operations  The format of the internal and external release will be unified  Multiple packaging formats slow down release preparation tool based (LCFGng) manual (+tar ball based)  All components are treated equal same level of testing for core components and non vital tools no difference for tools already in use in other project Process of including new components not totally transparent Picking a good time for a new release is difficult  conflict between users (NOW) and sites (planned) Upgrading has proven to be a high risk operation  some sites suffered from acute configuration amnesia Process was one of the topics in the “LCG Operations Workshop”

LHC Review CERN IT-GD November Impact of Data Challenges Large scale production effort of the LHC experiments  test and validate the computing models  produce needed simulated data  test experiments production frame works and software  test the provided grid middleware  test the services provided by LCG-2 All experiments used LCG-2 for part of their production Details will be given by Dario Barberis

LHC Review CERN IT-GD November General Observations Real production is different from certification tests  Usage characteristics of real production can’t be simulated  (Detailed) planning is of limited use time better spend with small(er) scale pilot DCs Time matters  Delays in support and operation are deadly during DCs  Several iterations needed to get it right  Communication between sites, operations, and experiments matters not all players handled DCs with same priority (communication problem) Middleware matures quickly during DCs  Scalability, robustness, functionality  Several of the experiments will do continues production from now on Probing concurrently multiple “dimensions” of the system can be confusing  Problems with different components mix leads to confuse cause and effect  Dedicated “Service Challenges” will help to get a clear matrix DC summary

LHC Review CERN IT-GD November Problems during the data challenges All experiments encountered on LCG-2 similar problems LCG sites suffering from configuration and operational problems  not adequate resources on some sites (hardware, human..)  this is now the main source of failures Load balancing between different sites is problematic  jobs can be “attracted” to sites that have no adequate resources  modern batch systems are too complex and dynamic to summarize their behavior in a few values in the IS Identification and location of problems in LCG-2 is difficult  distributed environment, access to many logfiles needed…..  status of monitoring tools Handling thousands of jobs is time consuming and tedious  Support for bulk operation is not adequate Performance and scalability of services  storage (access and number of files)  job submission  information system  file catalogues Services suffered from hardware problems  (no fail over (design problem)) DC summary

LHC Review CERN IT-GD November Outstanding Middleware Issues Collection: Outstanding Middleware IssuesOutstanding Middleware Issues  Important: 1 st systematic confrontation of required functionalities with capabilities of the existing middleware Some can be patched, worked around, Those related to fundamental problems with underlying models and architectures have to be input as essential requirements to future developments (EGEE) Middleware is now not perfect but quite stable  Much has been improved during DC’s A lot of effort still going into improvements and fixes Easy to deploy and operate storage management still an issue –especially for Tier 2 sites –effort to adapt dCache –effort in simple Disk Pool Manager

LHC Review CERN IT-GD November Operational issues (selection) Slow response from sites  Upgrades, response to problems, etc.  Problems reported daily – some problems last for weeks Lack of staff available to fix problems  Vacation period, other high priority tasks Various mis-configurations (see next slide) Lack of configuration management – problems that are fixed re-appear Lack of fabric management (mostly smaller sites)  scratch space, single nodes drain queues, incomplete upgrades, …. Lack of understanding  Admins reformat disks of SE … Provided documentation often not (carefully) read new activity to develop adaptive documentation  simpler way to install middleware on farm nodes  opens ways to maintain middleware remotely in user space Firewall issues –  often less than optimal coordination between grid admins and firewall maintainers openPBS problems  Scalability, robustness (switching to torque helps)

LHC Review CERN IT-GD November Site (mis) - configurations Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems:  – The variable VO SW DIR points to a non existent area on WNs.  – The ESM is not allowed to write in the area dedicated to the software installation  – Only one certificate allowed to be mapped to the ESM local account  – Wrong information published in the information system (Glue Object Classes not linked)  – Queue time limits published in minutes instead of seconds and not normalized  – /etc/ld.so.conf not properly configured. Shared libraries not found.  – Machines not synchronized in time  – Grid-mapfiles not properly built  – Pool accounts not created but the rest of the tools configured with pool accounts  – Firewall issues  – CA files not properly installed  – NFS problems for home directories or ESM areas  – Services configured to use the wrong/no Information Index (BDII)  – Wrong user profiles  – Default user shell environment too big  Only partly related to middleware complexity integrated all common small problems into ONE BIG PROBLEM

LHC Review CERN IT-GD November Operating Services for DCs Multiple instances of core services for each of the experiments  separates problems, avoids interference between experiments  improves availability  allows experiments to maintain individual configuration  addresses scalability to some degree Monitoring tools for services currently not adequate  tools under development to implement control system Access to storage via load balanced interfaces  CASTOR  dCache Services that carry “state” are problematic to restart on new nodes  needed after hardware problems, or security problems “State Transition” between partial usage and full usage of resources  required change in queue configuration (faire share, individual queues/VO)  next release will come with description for fair share configuration (smaller sites) DC summary

LHC Review CERN IT-GD November Support during the DCs User (Experiment) Support:  GD at CERN worked very close with the experiments production managers  Informal exchange ( , meetings, phone) “No Secrets” approach, GD people on experiments mail lists and vice versa –ensured fast response tracking of problems tedious, but both sites have been patient clear learning curve on BOTH sites LCG GGUS (grid user support) at FZK became operational after start of the DCs –due to the importance of the DCs the experiments switch slowly to the new service Very good end user documentation by GD-EIS Dedicated testbed for experiments with next LCG-2 release –rapid feedback, influenced what made it into the next release Installation and site operations support:  GD prepared releases and supported sites (certification, re-certification)  Regional centres supported their local sites (some more, some less)  Community style help via mailing list (high traffic!!)  FAQ lists for trouble shooting and configuration issues: Taipei RALTaipeiRAL

LHC Review CERN IT-GD November Support during the DCs Operations Service:  RAL (UK) is leading sub-project on developing operations services  Initial prototype Basic monitoring tools Mail lists for problem resolution Working on defining policies for operation, responsibilities (draft document) Working on grid wide accounting  Monitoring: GridICE (development of DataTag Nagios-based tools) GridPP job submission monitoring Information system monitoring and consistency check  CERN GD daily re-certification of sites (including history)sites (including history) escalation procedure tracing of site specific problems via problem tracking tool tests core services and configuration

LHC Review CERN IT-GD November Screen Shots

LHC Review CERN IT-GD November Screen Shots

LHC Review CERN IT-GD November Some More Monitoring

LHC Review CERN IT-GD November Problem Handling PLAN VO AVO BVO C GD CERN GGUS (Remedy) GOC P-Site-1 S-Site-1S-Site-2 P-Site-2 S-Site-2S-Site-1 Triage: VO / GRID Monitoring/Follow-up Escalation

LHC Review CERN IT-GD November Community Problem Handling Operation (most cases) VO A VO B VO C GD CERN GGUS GOC P-Site-1 S-Site-1S-Site-2 S-Site-1 Triage Monitoring FAQs Rollout Mailing List Monitoring Certification Follow-Up FAQs S-Site-3

LHC Review CERN IT-GD November LCG Workshop on Operational Issues Motivation  LCG -> (LCG&EGEE) transition requires changes  Lessons learned need to be implemented  Many different activities need to be coordinated November at CERN  >80 participants including from GRID3 and NorduGrid  Agenda: HereHere  1.5 days of plenary sessions describe status and stimulate discussion  1 day parallel/joint working groups very concrete work, results into creation of task lists with names attached to items  0.5 days of reports of the WG

LHC Review CERN IT-GD November LCG Workshop on Operational Issues WGs I Operational Security  Incident Handling Process  Variance in site support availability  Reporting Channels  Service Challenges Operational Support  Workflow for operations & security actions  What tools are needed to implement the model  “24X7” global support sharing operational load (taking turns)  Communication  Problem Tracking System  Defining Responsibilities problem follow-up deployment of new releases  Interface to User Support

LHC Review CERN IT-GD November LCG Workshop on Operational Issues WGs II Fabric Management  System installations  Batch/scheduling Systems  Fabric monitoring  Software installation  Representation of site status (load) in the Information System Software Management  Operations on and for VOs (add/remove/service discovery)  Fault tolerance, operations on running services (stop,upgrades, re-starts)  Link to developers  What level of intrusion can be tolerated on the WNs (farm nodes) application (experiment) software installation  Removing/(re-adding) sites with (fixed)troubles  Multiple views in the information system (maintenance)

LHC Review CERN IT-GD November LCG Workshop on Operational Issues WGs III User Support  Defining what “User Support” means  Models for implementing a working user support need for a Central User Support Coordination Team (CUSC) –mandate and tasks distributed/central (CUSC/RUSC) workflow  VO-support continuous support on integrating the VOs software with the middleware end user documentation FAQs

LHC Review CERN IT-GD November LCG Workshop on Operational Issues Summary Very productive workshop Partners (sites) assumed responsibility for tasks Discussions very much focused on practical matters Some problems ask for architectural changes  gLite has to address these It became clear that not all sites are created equal Removing troubled sites is inherently problematic  removing storage can have grid wide impact Key issues in all aspects is to define split between:  Local,Regional and Central control and responsibility All WGs discussed communication

LHC Review CERN IT-GD November Summary LCG-2 services have been supporting the data challenges  Many middleware problems have been found – many addressed  Middleware itself is reasonably stable Biggest outstanding issues are related to providing and maintaining stable operations Future middleware has to take this into account:  Must be more manageable, trivial to configure and install  Management and monitoring must be built into services from the start on Operational Workshop has started many activities  Follow-up and keeping up the momentum is now essential  Indicates a clear shift away from the CERNtralized operation

LHC Review CERN IT-GD November Preparing a Release Initial List & Prioritization C&T EIS GIS GDB Applications Sites Bugs/Patches/Task Savannah Bugs/Patches/Task Savannah Wish list for next release Wish list for next release

LHC Review CERN IT-GD November Preparing a Release Prioritization C&T EIS GIS Head of Deployment Head of Deployment Wish list for next release Wish list for next release prioritization & selectiont List for next release (can be empty) List for next release (can be empty)

LHC Review CERN IT-GD November Preparing a Release Integration Developers List for next release List for next release integration & first tests C&T Internal Release Internal Release Deployment on EIS testbed EIS Applications

LHC Review CERN IT-GD November Preparing a Release Certification Developers full deployment on test clusters functional/stress tests 1 week C&T Internal Release Internal Release Final Internal Release Final Internal Release LCFGng & change record

LHC Review CERN IT-GD November Preparing a Release Preparations for Distribution LCFGng & change record Final Internal Release Final Internal Release Update User Guides EIS Update Release Notes Finalize LCFGng Conf. Prepare Manual Guide GIS LCFGng Install Test Manual Install Test GIS Release Notes Installation Guides User Guides Synchronize Applications Sites Announce release on the LCG-Rollout list GIS

LHC Review CERN IT-GD November Preparing a Release Upgrading the sites Announce release on the LCG-Rollout list GIS Upgrade Install Sites Re-Certify GIS Sites upgrade at own pace Certification is run daily