Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.

Slides:



Advertisements
Similar presentations
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Advertisements

High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
National Manager Database Services
Implementing High Availability
CERN DNS Load Balancing Vladimír Bahyl IT-FIO. 26 November 2007WLCG Service Reliability Workshop2 Outline  Problem description and possible solutions.
November 2009 Network Disaster Recovery October 2014.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
© 2005 Mt Xia Technical Consulting Group - All Rights Reserved. HACMP – High Availability Introduction Presentation November, 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
Module 10: Maintaining High-Availability. Overview Introduction to Availability Increasing Availability Using Failover Clustering Standby Servers and.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
High Availability in DB2 Nishant Sinha
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
LCG LCG-1 Deployment and usage experience Lev Shamardin SINP MSU, Moscow
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
1Maria Dimou- cern-it-gd LCG November 2007 GDB October 2007 VOM(R)S Workshop report Grid Deployment Board.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Patricia Méndez Lorenzo Status of the T0 services.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.
EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.
CERN site report Operational aspects of Grid Services at the Tier-0.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Servizi core INFN Grid presso il CNAF: setup attuale
Jean-Philippe Baud, IT-GD, CERN November 2007
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
High Availability 24 hours a day, 7 days a week, 365 days a year…
High Availability Linux (HA Linux)
Use of Nagios in Central European ROC
Cross-site problem resolution Focus on reliable file transfer service
Database Services at CERN Status Update
Maximum Availability Architecture Enterprise Technology Centre.
Castor services at the Tier-0
WLCG Service Interventions
SpiraTest/Plan/Team Deployment Considerations
Deploying Production GRID Servers & Services
Presentation transcript:

Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI

20 December 2005MoU Targets at Tier02 How to do it  Choose fail-safe hardware  Have ultra-reliable networking  Write bug-free programs  Use administrators who never make mistakes  Find users who read the documentation

20 December 2005MoU Targets at Tier03 Agenda  MoU Levels  Procedures  High Availability approaches

20 December 2005MoU Targets at Tier04 LCG Services Class  Ref:  Defines availability rather than raw performance metrics ClassDescriptionDowntimeReducedDegradedAvail CCritical1 hour 4 hours99% HHigh4 hours6 hours 99% MMedium6 hours 12 hours99% LLow12 hours24 hours48 hours98% UUnmanagedNone

20 December 2005MoU Targets at Tier05 Downtime from a failure Failure OccursSomething breaks Failure DetectedLatencies due to polling of status Failure NoticedConsole, , Siren,… Investigation StartedLogin, have a look Problem IdentifiedRoot cause found Procedure FoundHow to solve it Problem SolvedExecute the procedure Restore ProductionCleanup

20 December 2005MoU Targets at Tier06 MoU is not very ambitious  99% uptime  1.7 hours / week down  4 days / year down  Does not cover impact of failure  Lost jobs / Recovery / Retries  Problem Analysis  Glitch effects  Core services have domino effects  MyProxy, VOMS, SRMs, Network  User Availability is sum of dependencies  FTS, RB, CE

20 December 2005MoU Targets at Tier07 Coverage  Standard availability does not cover  Weekends  Night time  Working Time = 40 hours / week = 24%  Dead time  Meetings / Workshops  No checks before morning status reviews and coffee  Illness / Holidays  Response Time (assuming available)  If on site, < 5 minutes  If at home and access sufficient, < 30 minutes  If on-site required, ~ 1 hour ?

20 December 2005MoU Targets at Tier08 Changes  New release needed rapidly  Security patches  Interface changes  Slow quiesce time to drain  1 week for jobs to complete  1 week proxy lifetime  Many applications do not provide drain or migrate functionality  Continue to serve existing requests  Do not accept new requests

20 December 2005MoU Targets at Tier09 How to Reconcile  People and Procedures  Call trees and on-call presence coverage  Defined activities for available skills  Technical  Good quality hardware  High availability  Degraded services

20 December 2005MoU Targets at Tier010 People and Procedures – Bottom Up Lemon Alerts Sysadmin on CallApplication Specialist

20 December 2005MoU Targets at Tier011 People and Procedures  Alerting  24x7 Operator receives problem from Lemon  Follows per-alert procedure to fix or identify correct next level contact  SysAdmin / Fabric Services  24x7 for more complex procedures  Application Expert  As defined by the grid support structure

20 December 2005MoU Targets at Tier012 Technical Building Blocks  Minimal Hardware for Servers  Load Balancing  RAC Databases  High Availability Toolkits  Cluster File Systems

20 December 2005MoU Targets at Tier013 Server Hardware Setup  Minimal Standards  Rack mounted  Redundant power supplies  RAID on system and data disks  Console access  UPS  Physical access control  Batch worker nodes do not qualify even if they are readily available

20 December 2005MoU Targets at Tier014 Load Balancing Grid Applications  Least loaded ‘n’ running machines returned to client in random order  Lemon metrics used to availability and load  See upcoming talk at CHEP’06

20 December 2005MoU Targets at Tier015 State Databases RAID 0  Oracle RAC configuration with no single points of failure  Used for all grid applications which can support Oracle  Allows stateless load balanced application servers  It really works

20 December 2005MoU Targets at Tier016 High Availability Toolkits  FIO is using Linux-HA  running at 100s of sites on Linux, Solaris and BSD.  Switch when  Service goes down  Administrator request  Switch with  IP Address of master machine  Shared disk (requires Fibre Channel)  Application specific procedures

20 December 2005MoU Targets at Tier017 Typical Configuration with HA  Redundancy eliminates Single Points Of Failure (SPOF)  Monitoring determines when things need to change  Can be administrator initiated for planned changes

20 December 2005MoU Targets at Tier018 Failure Scenario with HA  Monitoring detects failures (hardware, network, applications)  Automatic Recovery from failures (no human intervention)  Managed restart or failover to standby systems, components

20 December 2005MoU Targets at Tier019 Cluster File Systems  NFS does not work in production conditions under load  FIO has tested 7 different cluster file systems to try to identify a good shared highly available file system  Basic tests (disconnect servers, kill disks) show instability or corruption  No silver bullet as all solutions are immature in the high availability area  Therefore, we try to avoid any shared file systems in the CERN grid environment

20 December 2005MoU Targets at Tier020 BDII  BDII is easy since the only state data is the list of sites  Load Balancing based on Lemon sensor which checks the longitude/latitude of CERN  Lemon monitoring of current load based on number of LDAP searches

20 December 2005MoU Targets at Tier021 BDII Lemon Monitoring  New machine started production mid November  Load Balancing turned on at the end November

20 December 2005MoU Targets at Tier022 MyProxy  MyProxy has a replication function to create a slave server  Slave server is only read-only for proxy retrieval  Second copy made at regular intervals in case of server failure  TCP/IP network alias switched by Linux-HA in the event of the master proxy server going down  Slave monitors the master to check all is running ok

20 December 2005MoU Targets at Tier023 RBs and CEs – No HA solution  Currently no High Availability solution as state data is on local file system  Plan to run two machines with manual switch over using an IP alias  2 nd machine can be used by production super-users when 1 st machine is running ok  Could consider shared disk solution with standby machine  Drain time is around 1 week

20 December 2005MoU Targets at Tier024 LFC  Application front ends are stateless  RAC databases provide state data

20 December 2005MoU Targets at Tier025 FTS  Load Balanced front end  Agents are warm, becoming hot

20 December 2005MoU Targets at Tier026 VOMS  VOMS gLite is highly available front end using DNS load balancing. Slave reports itself as very low priority compared to master for log stability  LDAP access is to be reduced so less critical

20 December 2005MoU Targets at Tier027 Summary of Approaches  Highly available service using HA toolkits / Oracle RAC – single failure is covered by switch to alternative system  VO based services with spares – single failure may cause one VO to lose function but other VOs remain up  File system based stateful services problematic - Need  Cluster file system or  Application re-architecting  User acceptance of increased time to recover / manual intervention

20 December 2005MoU Targets at Tier028  Only critical and high products considered for high availability so far  Others may be worth considering  SFT, GridView, GridPeek  R-GMA, MonBox Other Applications

20 December 2005MoU Targets at Tier029 Current Status  BDIIs now in production with procedures in place  MyProxy, CEs nearing completion of automatic software installation and setup  FTS, LFC, VOMS, GridView hardware ready  RB not there yet

20 December 2005MoU Targets at Tier030 Conclusions  Adding High Availability is difficult but sometimes possible at fabric level  Applications need to be designed with availability in mind (FTS, LFC are good examples of this)  Planned changes are more frequent than hardware failures. Change automation reduces impact  Procedures and problem determination guides to minimise downtime