Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

Slides:

Advertisements

Similar presentations

LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.

Advertisements

RAL Tier1 Operations Andrew Sansum 18 th April 2012.

RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.

Configuration management

Information Technology Disaster Recovery Awareness Program.

Nick Brook University of Bristol The LHC Experiments & Lattice EB News Brief overview of the expts  ATLAS  CMS  LHCb  Lattice.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 17 Slide 1 Rapid software development.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Managing the Information Technology Resource Jerry N. Luftman

Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin CHAPTER FIVE INFRASTRUCTURES: SUSTAINABLE TECHNOLOGIES CHAPTER.

John Graham – STRATEGIC Information Group Steve Lamb - QAD Disaster Recovery Planning MMUG Spring 2013 March 19, 2013 Cleveland, OH 03/19/2013MMUG Cleveland.

November 2009 Network Disaster Recovery October 2014.

October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

IT Business Continuity Briefing March 3,  Incident Overview  Improving the power posture of the Primary Data Center  STAGEnet Redundancy  Telephone.

Developing a Disaster Recovery Plan Bb World ’06 San Diego, Calif. Poster Session Presented by Crystal Nielsen, M.A. Instructional Technologist Northwest.

David N. Wozei Systems Administrator, IT Auditor.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,

Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

24x7 Support Ian Bird GDB 9 th September The response times in the above table refer only to the maximum delay before action is taken to repair.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.

The DR Datacentre - is there a more Cost-Effective way? Dennis Adams a s s o c i a t e s UK Oracle User Group Conference 2007 Dennis Adams 3rd December.

CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!

GridPP3 project status Sarah Pearce 24 April 2010 GridPP25 Ambleside.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

Tier1A Status Andrew Sansum 30 January Overview Systems Staff Projects.

Connect communicate collaborate Design and Set Up of the New GÉANT NOC Toby Rodwell, DANTE TNC09, 9 June 2009.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

IT Priorities Minimize CAPEX Maximize employee productivity Grow the business Add new compute resources real- time to support growth Meet compliance requirements.

Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.

CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.

Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

ICT Disaster Recovery Plan Monitoring & Audit Committee 23 rd November 2010.

CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,

Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.

T0-T1 Networking Meeting 16th June Meeting

Cross-site problem resolution Focus on reliable file transfer service

Large Distributed Systems

Update on Plan for KISTI-GSDC

Maximum Availability Architecture Enterprise Technology Centre.

Oxford Site Report HEPSYSMAN

WLCG Service Interventions

GridPP Tier1 Review Fabric

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

IT OPERATIONS Session 7.

Presentation transcript:

Tier-1 Overview Andrew Sansum 21 November 2007

Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony mainly MoU commitments –CASTOR (Bonny) Storing the data and getting it to tape –Grid Infrastructure (Derek Ross) Grid Services dCache future Grid Only Access –Fabric Talk (Martin Bly) Procurements Hardware infrastructure (inc Local Network) Operation Afternoon Presentations –Neil (RAL benefits) –Site Networking (Robin Tasker) –Machine Rooms (Graham Robinson)

What I’ll Cover Mainly going to cover MoU commitments –Response Times –Reliability –On-Call –Disaster planning Also cover staffing

GRIDPP2 Team Organisation Grid Services Grid/exp Support Ross Condurache Hodges Klein (PPS) Vacancy Fabric (H/W and OS) Bly Wheeler Vacancy Thorne White (OS support) Adams (HW support) CASTOR SW/Robot Corney (GL) Strong (Service Manager) Folkes (HW Manager) deWitt Jensen Kruk Ketley Jackson (CASE) Prosser (Contractor) (Nominally 5.5 FTE) Machine Room operations (1.5 FTE) Networking Support (0.5 FTE) Database Support (0.5 FTE) (Brown) Project Management (Sansum/Gordon/(Kelsey)) (1.5 FTE)

Staff Evolution to GRIDPP3 Level –GRIDPP2 (13.5 GRIDPP e-Science) –GRIDPP3 (17.0 GRIDPP e-Science) Main changes –Hardware repair effort 1->2 FTE –New incident response team (2 FTE) –Extra castor effort (0.5 FTE) (but this is already effort that has been working on CASTOR unreported. –Small changes elsewhere Main problem –We have injected 2 FTE of effort temporarily into CASTOR. Long term GRIDPP3 plan funds less effort than current experience suggests that we need.

ServiceMaximum delay in responding to operational problemsAverage availability measured on an annual basis Service interru ption Degradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 12 hours 24 hours99%n/a Networking service to the Tier-0 during accelerator operation 12 hours24 hours48 hours98%n/a Data-intensive analysis services, including networking to Tier-0, Tier-1 centres 24 hours48 hours 98% All other services – prime service hours [1] [1] 2 hour 4 hours98% All other services – other times 24 hours48 hours 97% [1] [1] Prime service hours are 08:00-18:00 during the working week of the centre, except public holidays. WLCG/GRIDPP MoU Expectations

Response Time Time to acknowledge fault ticket hour response time outside prime shift On-call system should easily cover this provided possible to automatically classify problem tickets by level of service required. Cover during prime shift more challenging (2-4 hours) but is already a routine task for Admin on Duty To hit availability target must be much faster (2 hours or less)

Reliability Have made good progress in last 12 months –Prioritised issues affecting SAM test failures. –Introduced “issue tracking” and weekly reviews of outstanding issues. –Introduced resilience into trouble spots (but more still to do) –Moved services to appropriate capacity hardware, seperated services, etc etc. –Introduced new team role: “Admin on Duty”. Monitoring farm operation, ticket progression, EGEE broadcast info. Best Tier-1 averaged over last 3 months (other than CERN).

RAL-LCG2 Availability

MoU Commitments (Availability) Really reliability (availability while scheduled up) Still tough – 97-99% service availability will be hard (1% is just 87 hours per year). –OPN reliability predicted to be 98% without resilience, site SJ5 connection is much better (Robin will discuss). –Most faults (75%) will fall outside normal working hours –Software components still changing (eg CASTOR upgrades, WMS) etc. –Many faults in 2008 will be “new” only emerging as WLCG ramps up to full load. –Emergent faults can take a long time to diagnose and fix (days) To improve on current availability will need to: –Improve automation –Speed up manual recovery process –Improve monitoring further –Provide on-call

On-Call On-Call will be essential in order to meet response and availability targets. On-Call project now running (Matt Hodges), target is to have on-call operational by March Automation/recovery/monitoring all important parts of on-call system. Avoid callouts by avoiding problems. May be possible to have some weekend on-call cover before March for some components. On-call will continue to evolve after March as we learn from experience.

Disaster Planning (I) Extreme end of availability problem. Risk analysis exists, but aging and not fully developed. Highest Impact risks: –Extended environment problem in machine room Fire Flood Power Failure Cooling failure –Extended network failure –Major data loss through loss of CASTOR metadata –Major security incident (site or Tier-1)

Disaster Planning (II) Some disaster plan components exist –Disaster plan for machine room. Assuming equipment is undamaged, relocate and endeavour to sustain functions but at much reduced capacity. –Datastore (ADS) disaster recovery plan developed and tested –Network plan exists –Individual Tier-1 systems have documented recovery processes and fire-safe backups or can be instanced from kickstart server. Not all these are simple nor are all fully tested. Key Missing Components –National/Global services (RGMA/FTS/BDII/LFC/…). Address by distributing elsewhere. Probably feasible and is necessary – 6 months. –CASTOR – All our data holdings depend on integrity of catalogue. Recover from first principles not tested. Is flagged as a priority area but balance against need to make CASTOR work. –Second – independent Tier-1 build infrastructure to allow us to rebuild Tier-1 at new physical location. Would allow us to address major issues such as fire. Major project – priority?

Conclusions Made a lot of progress in many areas this year. Availability improving, hardware reliable, CASTOR working quite well and upgrades on- track. Main challenges for 2008 (data taking) –Large hardware installations and almost immediate next procurement –CASTOR at full load –On-call and general MoU processes