CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.

Slides:



Advertisements
Similar presentations
Data Management TEG Status Dirk Duellmann & Brian Bockelman WLCG GDB, 9. Nov 2011.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t Experience with NetApp at CERN IT/DB Giacomo Tenaglia on behalf of Eric Grancher Ruben.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
CERN IT Department CH-1211 Geneva 23 Switzerland t Grid Support Group 1 st Group Meeting of 2008 January 18 th 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
LCG Introduction John Gordon, STFC-RAL GDB September 9 th, 2008.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
LCG CCRC’08 Status WLCG Management Board November 27 th 2007
CERN IT Department CH-1211 Genève 23 Switzerland t 24x7 Service Support Tony Cass LCG GDB, 24 th November 2009.
1 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
23.March 2004Bernd Panzer-Steindel, CERN/IT1 LCG Workshop Computing Fabric.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland t ASM and Oracle Service Availability Monitoring LCG 3D Workshop CERN, January 26 th,
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc News from the CMS computing and offline monitoring.
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG IPv6 deployment strategy
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
EGEE Middleware Activities Overview
High Availability Linux (HA Linux)
Database Readiness Workshop Intro & Goals
Principles of Network Applications
The CREAM CE: When can the LCG-CE be replaced?
Savannah to Jira Migration
Distribution and components
WLCG Service Interventions
WLCG Accounting Task Force Update Julia Andreeva CERN WLCG Workshop 08
Chapter 18 MobileApp Design
Workshop Summary Dirk Duellmann.
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Cloud computing Technology: innovation. Points  Cloud Computing and Social Network Sites have become major trends not only in business but also in various.
Cloud computing Technology: innovation. Points  Cloud Computing and Social Network Sites have become major trends not only in business but also in various.
Cache writes and examples
Deploying Production GRID Servers & Services
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Experiment Requests For the most critical services, maximum downtime of 30’ has been requested As has been stated on several occasions, including at the WLCG Service Reliability workshop and at the OB, maximum downtime of 30’ is impossible to guarantee at affordable cost 30’ – even for maximum time for a human to begin to intervene – cannot be guaranteed –e.g. IT department meeting of yesterday!  But much can be done in terms of reliability – by design! (See next slides…) A realistic time for intervention (out of hours – when they are likely to occur!) is 4 hours Christmas shutdown text typically says ½ a day

Reliable Services – The Techniques DNS load balancing Oracle “Real Application Clusters” & DataGuard H/A Linux (less recommended… because its not really H/A…)  Murphy’s law of Grid Computing! Standard operations procedures: – Contact name(s); basic monitoring & alarms; procedures; hardware matching requirements;  No free lunch! Work must be done right from the start (design - next) through to operations (much harder to retrofit…) Reliable services take less effort(!) to run than unreliable ones!  At least one WLCG service (VOMS) middleware does not currently meet stated service availability requirements  Also, ‘flexibility’ not needed by this community has sometimes led to excessive complexity (complexity is the enemy of reliability) (WMS)  Need also to work through experiment services using a ‘service dashboard’ as was done for WLCG services (see draft service map)service map

CHEP 2007 LCG Middleware Developers’ Tips  ”The key point about designing middleware for robustness and resilience is to incorporate these aspects into the initial design.  This is because many of the deployment and operational features already discussed have an impact on the basic architecture and design of the software; it is typically much more expensive to retrofit high- availability features onto a software product after the design and implementation (although it is possible).  In general, the service scaling and high-availability needs typically mandate a more decoupled architecture.”

CHEP 2007 LCG Service Reliability: Follow-up Actions 1.Check m/w (prioritized) against techniques – which can / do use them and which cannot?  Priorities for development (service) 2.Experiments’ Lists of critical services: service map (FIO+GD criteria) 3.Measured improvement – how do we do it? 4.VO Boxes  VO services 5.Tests – do they exist for all the requested ‘services’?  SAM tests for experiments 6.ATLAS & CMS: warrant a dedicated coordinator on both sides 7.Database services: IT & experiment specific 8.Storage – does this warrant a dedicated coordinator? Follow-up by implementation 9.Revisit for Tier1s (and larger Tier2s) 10.Overall coordination?  LCG SCM  GDB  MB/OB 11.Day 1 of WLCG Collaboration workshop in April (21 st ) 12.Long-term follow-up?  solved problem by CHEP “Cook-book” – the current “knowledge” is scattered over a number of papers – should we put it all together in one place? (Probably a paper of at least 20 pages, but this should not be an issue.)paper

CERN IT Department CH-1211 Genève 23 Switzerland t VOBOX Hardware: Resource requirements and planning – it is not always easy to have an additional disk on demand because “/data” becomes full Hardware warranty –Plan for hardware renewal –Check warranty duration before moving to production Hardware naming and labeling –Make use of aliases to facilitate hardware replacement –Have a “good” name on the sticker e.g. All lxbiiii machines may be switched off by hand in case of a cooling problem  Some “critical services” run over Xmas were just that – and nodename hard-coded in application!

CHEP 2007 LCG On-going Follow-up  We have one day (Monday 21 st April) dedicated to follow-up on measured improvement in service reliability  Using the Grid Service Map, we will focus on the most critical services and work through the lists of all experiments  It requires work on all sides (service provider, developers, experiments) to make concrete progress  For the services where the guidelines are already implemented, production experience is consistent with the requests from the experiments  We should use the features of the Grid to ensure that the overall service reliability is consistent with the requirements (i.e. no SPOFs)  Individual components may fail, but the overall service can and should continue!