David Britton, 28/May/09.. 2 14 TeV Collisions 27 km circumference 1200 14m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.

GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09.

S.L.LloydATSE e-Science Visit April 2004Slide 1 GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.

Tier1 View: Resilience Status, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.

Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.

Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.

Project Status David Britton,15/Dec/ Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

RAL Tier1 Operations Andrew Sansum 18 th April 2012.

B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.

GLite adoption and opportunities for collaboration with industry Tony Doyle Distributed Computing Workshop Westminster, 21 May 2008.

Slide 1 of 24 Steve Lloyd NW Grid Seminar - 11 May 2006 GridPP and the Grid for Particle Physics Steve Lloyd Queen Mary, University of London NW Grid Seminar.

SymmetryBeauty Physics Grid Large Hadron Collider Particle Physics Condensed Matter Astro physics Cosmology Nuclear Physics Atomic Physics Biological Physics.

31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.

Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.

The LHC experiments AuthZ Interoperation requirements GGF16, Athens 16 February 2006 David Kelsey CCLRC/RAL, UK

1 DDS Xpress Digital Data Storage Solution. 2 Long-term Goal Legacy Telecoms switches are still operational Expected lifespan at least another 10 years.

Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.

Advanced Lustre® Infrastructure Monitoring (Resolving the Storage I/O Bottleneck and managing the beast) Torben Kling Petersen, PhD Principal Architect.

13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

GridPP From Prototype to Production David Britton 21/Sep/06 1.Context – Introduction to GridPP 2.Performance of the GridPP/EGEE/wLCG Grid 3.Some Successes.

Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

…building the next IT revolution From Web to Grid…

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

The WLCG Service from a Tier1 Viewpoint Gareth Smith 7 th July 2010.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

U.S. ATLAS Grid Production Experience

Cross-site problem resolution Focus on reliable file transfer service

Maximum Availability Architecture Enterprise Technology Centre.

Castor services at the Tier-0

LHC Data Analysis using a worldwide computing grid

The LHCb Computing Data Challenge DC06

Presentation transcript:

David Britton, 28/May/09.

2 14 TeV Collisions 27 km circumference m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L liquid He 12,000,000L liquid N 2 800,000,000 proton-proton collisions/sec. The Large Hadron Collider at CERN

3 Data from the LHC Experiments 55 million channels Raw data = 220 MB/s 18 million channels Raw data = 100 MB/s ATLAS (7,000 tonnes)CMS (12,500 tonnes) ALICE (10,000 tonnes) LHCb (5,600 tonnes) 1.2 million channels Raw data = 50 MB/s Concorde (15 Km) Mt. Blanc (4.8 Km) One years data from LHC would fill a stack of CDs 20km high Raw data flow ~700 MB/s Total ~15 PB of data per year 100 million channels Raw data = 320 MB/s

4 Data Driven Grid Computing 31/03/2014 Grid architecture chosen because : Costs of maintaining and updating resources more easily shared in a distributed environment. Funding bodies can provide local resources and contribute to global goal. More easy to build redundancy and fault tolerance and minimise risks from single point of failure. LHC will operate around the clock for 8 months each year. Spanning of time zones means that monitoring/support more readily provided. ALICE ATLAS CMS LHCb

5 Worldwide LHC Computing Grid 28/May/09 Tier 0 Tier 1 National centres Tier 2 Regional groups Institutes Workstations Offline farm Online system CERN computer centre RAL,UK ScotGridNorthGridSouthGridLondon FranceItalyGermanySpain GlasgowEdinburghDurham 11 T1 centres Simulation, Analysis Primary Data Store Reconstruction, Storage, Analysis

6 WorldWide Resources 55 Countries 283 Sites 180,000 CPUs Worldwide: 23 Sites 20,000 CPUs UK:

7 How does it work? Components Tier 0, Tier 1, Tier 2 DATA MOVEMENT – FILE TRANSFER SERVICE (FTS) STORAGE INTERFACE – STORAGE RESOURCE MANAGER (SRM) AUTHORISATION/ROLES – VIRTUAL ORGANISATION MEMBERSHIP (VOMS) METADATA/REPLICATION – LCG FILE CATALOGUE (LFC) BATCH SUBMISSION – WORKLOAD MANAGEMENT SYSTEM (WMS ) DISTRIBUTED CONDITIONS DATABASES – ORACLE STREAMS (3D) GRID INTERFACES (e.g. Ganga) PRODUCTION/ANALYSIS SYSTEMS GRID MIDDLEWARE EXPERIMENT FRAMEWORKS WLCG FABRIC

8 How does it work? Workflow gridui JDL VOMS WLMS JS RB LFC BDII Logging & Bookkeeping 3 CPU Nodes Storage Grid Enabled Resources CPU Nodes Storage Grid Enabled Resources CPU Nodes Storage Grid Enabled Resources CPU Nodes Storage Grid Enabled Resources 4 5 Submitter VOMS-proxy-init 1 Job Submission 2 Job Status? 11 Job Retrieval

9 Availability: The UK Tier-1 Availability fraction of time the site is up (so even scheduled maintenance counts against this metric). Target is 97% (achieved). Measured by SAM tests (Service Availability Monitor). There are also experiment-specific SAM tests which are more demanding. Example shown here is from ATLAS. Target is also 97%. Performance is improving but was degraded by the CASTOR mass storage system.

10 Availability: Full UK Picture

11 Resilience and Disaster Planning The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site. One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality. 28/May/09

12 Strategy Fortifying the Service Duplicate services or machines Increase the hardwares capacity (to handle faults) Use (good) fault detection Implement automatic restarts Provide fast intervention Fully investigate failures Report bugs -> ask for better middleware Disaster Planning Taking control early enough. (Pre-) establishing possible options. Understanding user priorities. Timely Action. Effective Communication. Hardware; Software; Location

13 Duplicating Services or Machines Multiple WMSes

14 Hardware Capacity and Fault Tolerance Examples: Storage – Use raid arrays: RAID5 RAID6 for storage arrays; RAID1 for system disks. Use of hot-spares allows automatic rebuilds. Memory – Increase memory capacity; use ECC (error-correction-code) memory and monitor for a rise in error correction rate. Power – Use redundant power supplies connected to different circuits if possible. UPS for critical systems. Interconnects - Use two or more bonded network connections with cables routed separately. CPU – Use more powerful machines. Databases – Use Oracle RACs (Real Application Clusters) which enable multiple servers to access database simultaneously. Resilient hardware will help services survive common failure modes and keep it operating until you can replace the component and make the service resilient again.

15 Fault Detection If it can be monitored, monitor it! Catch problems early e.g. with nagios alarms. Load alarms; File systems near to full; Certificates close to expiry; Failed drives Look for signatures of impending problems to predict component failure. Idle disks hide their faults –Regular low-level verification runs to push sick drives over the edge –Replace early in failure cycle So it doesnt fail during a rebuild… Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation –If you have redundant links, you can replace the faulty one and keep the service going Call-out system for problems that impact services

16 Intervention and Investigation 28/May/09 Run 24 x 7 call out system connected to a pager that is triggered by automatic alarms. 2 hour response time for critical failures. All incidents are examined to learn lessons: Call-out rate has dropped from 10/day to as low as 1/week. Reports written up on serious incidents (reported to the wLCG so other sites around the world can see).

17 Despite everything, disasters will happen 28/May/09 wLCG weekly operations report, Feb-09 (Taiwan)

18 Disaster Planning 28/May/09 Need a Disaster Response plan which is well understood – use it regularly for anything that could turn into a disaster! Stage 1: Disaster Potential Identified –Informally Assess/Monitor/Set deadlines/Do not interfere. Stage 2: Possible Disaster –Add internal management oversight/Formally assess/Divert resources Stage 3: Disaster Likely –Add external experts and stakeholder representation to oversight. –Regular meetings with the experiments. –Prepare contingencies; Communicate widely. Stage 4: Actual Disaster –Manage disaster according to high level disaster plan and contingencies identified at Stage-3. Communicate widely.

19 Summary In the UK we have spent the last 6 years preparing for the LHC data challenge and have deployed 20,000 CPUs as part of a world-wide Grid of 180,000 CPUs: The largest scientific computing Grid in the world. The last year has focused on making the service reliable and resilient: Our Tier-1 centre currently delivers 97% availability and our Tier-2 centres average over 90%. We have initiated planning to understand the possible responses to major disaster and to set up a disaster management process to handle such incidents. We look forward to the arrival of LHC data! 1/Apr/09 LHC Data