Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.

Slides:

Advertisements

Similar presentations

Northgrid Status Alessandra Forti Gridpp20 Dublin 12 March 2008.

Advertisements

NorthGrid status Alessandra Forti Gridpp15 RAL, 11 th January 2006.

Northgrid Status Alessandra Forti Gridpp21 Swansea 4 September 2008.

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

NorthGrid status Alessandra Forti Gridpp12 Brunel, 1 February 2005.

Liverpool HEP – Site Report May 2007 John Bland, Robert Fay.

Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.

Transaction Journaling

A new standard in Enterprise File Backup. Contents 1.Comparison with current backup methods 2.Introducing Snapshot EFB 3.Snapshot EFB features 4.Organization.

MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010.

NorthGrid status Alessandra Forti Gridpp13 Durham, 4 July 2005.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Site Report HEPHY-UIBK Austrian federated Tier 2 meeting

Oracle Database Administration. Rana Almurshed 2 course objective After completing this course you should be able to: install, create and administrate.

Use of Thin Clients in an Industrial Environment Foxboro Southeast User’s Group Birmingham, Al February 10-11, 2009 Walter Conner Senior Plant Engineer.

Managing a computerised PO Operating environment 1.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

November 2009 Network Disaster Recovery October 2014.

Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.

Module 12: Designing High Availability in Windows Server ® 2008.

16 th May 2006Alessandra Forti Storage Alessandra Forti Group seminar 16th May 2006.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

Cloud Computing Characteristics A service provided by large internet-based specialised data centres that offers storage, processing and computer resources.

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.

Overview of day-to-day operations Suzanne Poulat.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.

1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.

Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.

RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.

Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.

Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

6. Juli 2015 Dietrich Liko Physics Computing 114. Vorstandssitzung.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

Lesson learned after our recent cooling problem Michele Onofri, Stefano Zani, Andrea Chierici HEPiX Spring 2014.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.

BaBar Cluster Had been unstable mainly because of failing disks Very few (

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Evangelos Markatos and Charalampos Gkikas FORTH-ICS Athens, th Mar Institute of Computer Science - FORTH Christos.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

2007/05/22 Integration of virtualization software Pierre Girard ATLAS 3T1 Meeting

Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.

Jean-Philippe Baud, IT-GD, CERN November 2007

Server Upgrade HA/DR Integration

The Beijing Tier 2: status and plans

Oracle Database Administration

Oxford Site Report HEPSYSMAN

Presentation transcript:

Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009

Outline Resilience Hardware resilience Software changes resilience Manpower resilience Communication Site resilience status General status Conclusions

Resilience Definition: 1.The power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2.Ability to recover readily from illness, depression, adversity, or the like; buoyancy. Translation: – Hardware resilience: Redundancy and capacity. – Manpower resilience: Continuity – Software resilience: Simplicity and easiness of maintenance – Communication resilience: Effectiveness

Hardware resilience The system has to be redundant and has capacity enough to take the load. There are many levels of redundancy and capacity with increasing cost – Single machine components: disks, memory, CPUs – Full Redundancy: replication of services in the same room – Full redundancy paranoid: replication of services in different places Clearly there is a tradeoff on how important is a service and how much money a site has to do the replication

Manpower resilience The man power has to insure continuity of service. This continuity is lost when people change. – It takes many months to train a new system administrator – It takes even longer in the grid environment where there are no well defined guidelines, the documentation is dispersed and most of the knowledge goes from mouth to mouth Protocols and procedure for almost every action should be written to ensure continuity. – How to shut down a service for maintenance – What to do in case of security breach – Who to call is the main link to JANET goes down – What to do to update the software – What to do to reinsert a node in the batch system after a memory replacement –......

Software resilience Simplicity and easiness of maintenance are a key component to at least two things: – Service recovery in case disaster strikes – Less steep learning curve for new people The grid software is neither simple nor easy to maintain. It is complicated, ill-documented and changes continuously at the least. – Dcache is a flagship example of this and this is why it is being abandoned by many sites. – But there is also a problem with continuous changes in the software itself: lcg-CE, glite-CE,cream-CE, 4 or 5 storage sysems that are almost incompatible with each other, RB or WMS or experiments pilot frameworks, SRM yes, no SRM is dead

Communication Communication has to be effective. If one mean of communication is not effective it should be replaced with one more effective – I was always missing SA1 ACL requests for the SVN repository I redirected them to the manchester helpdesk. Now I respond within 2 hours during working hours – System admins in Manchester weren't listening to each other during meetings now there is a rule to write EVERYTHING in the tickets. – Atlas putting offline sites was a problem because the action was written in the atlas shifter elogs. Now they'll write it in the ticket so the site is made aware immediately of what is happening.

Lancaster Twin CEs New kit has dual PSU All systems in cfengine Daily back up of databases Current machine room has new redundant air con Temperature sensors with nagios alarms have been installed 2 nd machine room with modern chilled racks – Available in july Only on fibre uplink to JANET

Liverpool Strong points: Reviewed and fixed single points of failure 2 years ago. High spec servers with RAID1 and dual PSU. UPS on critical servers, RAIDS and switches. Distributed software servers with high level of redundancy. Active rack monitoring Nagios, Ganglia and custom scripts. RAID6 on SE data servers. WAN connection has redundancy and automatic failover recovery. Spares for long lead time items. Capability of maintaining our own hardware.

Liverpool (cont.) Weak points: BDII and MON nodes are old hardware. Single CE is single point of failure. Only 0.75 FTE over 3 years dedicated to grid admin. Air-con is ageing and in need of constant maintenance University has agreed to install new water-cooled racks for future new hardware.

Manchester Machine room: 2 generators + 3 UPS + 3 air cond unit – Uni staff dedicated to the maintenance Two independent clusters (2CEs, 2x2 SEs, 2 SW servers) All main services have raid1 and memory and disks have also been upgraded They are in the same rack, attached to different PDUs Services can be restarted from remote All services and worker nodes are installed and maintained with kickstart+cfengine which allows to reinstall the system within an hour – Anything that cannot go in cfengine goes in YAIM pre/local/post in an effort to eliminate any forgettable manual steps All services are monitored Backup system of all databases is in place

Manchester (cont) We lack protocols and procedures for dealing in the same way when a situation occurs – Started to write from things as simple as switching off machines for maintenance Disaster recovery happening only when a disaster happens Irregular maintenance periods brought to clashes with generators routine tests RT system used for comunication with users but also to log everything that is done in the T2 – Bad comunication between sys admins has been a major problem

Sheffield The main weak point for Sheffield is the limited physical access to the cluster. We have it 9-17 weekdays only. We use quite expensive SCSI disk for exp-software, as it's expensive we do not have a spare disk in the case of failure. So we need some time to order it plus to write all experimental software back CE and the Mon Box have only one power supply and only one disk each. In future perhaps RAID1 system with 2 PSUs for CE and the Mon box. It would be good to have UPS. DPM head node already has 2 PSUs and RAID5 system with extra disk. We have similar WN's, CE and MonBox, so can find spare parts. We managed to have quite stable reliability

General Status (1) 17% DPMyes SL4Glite3.1 Sheffiel d 15% /104/ dcache/D PM/xrootdyes SL4Glite3.1 Manche ster 10% Dcache -> DPMyes SL4Glite3.1 Liverpo ol 19% DPMyes SL4Glite3.1 Lancast er Stor age usag e % Used Storage (TB) Storage (TB) CPU (kSI2K) SRM brand Space Tokens SRM2. 2OS Middle wareSite

General Status (2)

General Status (3)

General Status (4)

Conclusions As it was written on the building sites of the Milan 3 rd underground line: We are working for you!