Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Hardware failures Wayne Salter on behalf of Olof B ärring.

Slides:



Advertisements
Similar presentations
IS 4506 Tuning and Monitoring Internet Information Server.
Advertisements

PC Deployment, Maintenance, Replacement & Discard VUSD Technological Services Jerry Walkowiak, Director Chuck Boone, Systems Manager.
Confidential Prepared by: System Sales PM Version: 1.0 Lean Design with Luxury Performance.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Introduction to Information Technology: Your Digital World © 2013 The McGraw-Hill Companies, Inc. All rights reserved.Using Information Technology, 10e©
Component 4: Introduction to Information and Computer Science Unit 1: Basic Computing Concepts, Including History Lecture 1 This material was developed.
Secondary Storage Unit 013: Systems Architecture Workbook: Secondary Storage 1G.
1 RAL Status and Plans Carmine Cioffi Database Administrator and Developer 3D Workshop, CERN, November 2009.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
THE CPU Cpu brands AMD cpu Intel cpu By Nathan Ferguson.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
HEPiX 21/05/2014 Olof Bärring, Marco Guerri – CERN IT
How Computers Work. A computer is a machine f or the storage and processing of information. Computers consist of hardware (what you can touch) and software.
Basic Computer Structure and Knowledge Project Work.
Fundamentals of Networking Discovery 1, Chapter 2 Operating Systems.
By: Dwayne Burl.  The Central Processing Unit (CPU) is responsible for interpreting and executing most of the commands from the computer's hardware and.
Chapter 6 Advanced Installation. Objectives  Describe the types and structure of SCSI devices  Explain the different levels of RAID and types of RAID.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Module 9 Review Questions 1. The ability for a system to continue when a hardware failure occurs is A. Failure tolerance B. Hardware tolerance C. Fault.
Hardware Case that houses the computer Monitor Keyboard and Mouse Disk Drives – floppy disk, hard disk, CD Motherboard Power Supply (PSU) Speakers Ports.
1 Chapter Overview Computer Cases Motherboards ROM BIOS.
Chapter 19 Upgrading and Expanding Your PC. Getting Started FAQs: – Can I upgrade the processor in my PC? – Will adding RAM improve my PC’s performance?
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Remote Hosting First Experiences Wayne Salter (with input.
Computer system components By Corey Burton. GPU GPU stands for ‘graphics processing unit’. The GPU can help the computer run smoothly. GPU is used for.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Working with Windows 7 at CERN Michał Budzowski.
BTEC National Diploma – IT Practitioners Unit 16 - Maintaining Computer Systems.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
Enterprise Resource Planning(ERP)
1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.
(Tahlia and Mabel are SUPA!!!). CPU CPU mean central process unit A central process is the carries out the instructions to computer programming.
Component 4: Introduction to Information and Computer Science
Experience with procuring, deploying and maintaining hardware at remote co-location centre CHEP’13 14 th October 2013 Afroditi XAFI, Alain GENTIT, Anthony.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Update on Windows 7 at CERN & Remote Desktop.
Security components of the CERN farm nodes Vladimír Bahyl CERN - IT/FIO Presented by Thorsten Kleinwort.
Hostnames used in CERN IT data centres AI forum 9 th of January 2014 Procurement team IT CF/FPP.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
CERN IT Department CH-1211 Genève 23 Switzerland t 24x7 Service Support Tony Cass LCG GDB, 24 th November 2009.
 Hardware compatibility means that software will run properly on the computer in which it is installed.  When purchasing software, look for one of these.
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
© 2006 EMC Corporation. All rights reserved. The Host Environment Module 2.1.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Tested, seen, heard… Andrei Maslennikov Rome, April 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
 System Requirements are the prerequisites needed in order for a software or any other resources to execute efficiently.  Most software defines two.
Page 1 Monitoring, Optimization, and Troubleshooting Lecture 10 Hassan Shuja 11/30/2004.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
2: Operating Systems Networking for Home & Small Business.
Networks. What is a Network? A network is a collection of computers and other devices that allow computer users to send and receive information to and.
Computer Systems Unit 2. Download the unit specification from moodle or the BTEC website Or alternatively visit ahmedictlecturer.wikispaces.com.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF SINDES Secure INformation DElivery System CERN IT/CF-ASI.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN IT Facility Planning and Procurement HEPiX Fall 2010 Workshop.
CERN IT Department CH-1211 Genève 23 Switzerland Benchmarking of CPU servers Benchmarking of (CPU) servers Dr. Ulrich Schwickerath, CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc News from the CMS computing and offline monitoring.
Cheltenham Courseware
CS111 Computer Programming
Revision Chapter 6 Two types of technologies used inside the hard drive Transfer speed for SATA I, SATA II and SATA III Different standards of SCSI How.
System unit Performed by Bektasova P.S. Checked by Sultanova Zh.Zh.
Mr C Johnston ICT Teacher
UNIT 17 Computing Support.
Computer Selection - Hardware Components
Migration Strategies – Business Desktop Deployment (BDD) Overview
DriveScale Log Collection Method of Procedure
1.00 Examine the role of hardware and software.
Presentation transcript:

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Hardware failures Wayne Salter on behalf of Olof B ärring

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Outline Failures –What fails? –How often? –When? Repairs –How? –By whom? –How quickly? Conclusions CERN IT facility

CERN IT Department CH-1211 Geneva 23 Switzerland t CF What fails? and how do we know? The only things we know for sure about hardware are: 1.It will fail 2.Some of it fails more often than other… disk drives for instance Monitoring failures –Disks: assume fail-stop but reality more complex –At CERN we base our decision on SMART counters and failed media scans Monitoring ‘repairs’ rather than ‘failures’: –Vendor tickets (~4k ) –Changes in serial numbers inventory (~10k ) CERN IT facility

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Failure space CERN IT by numbers (14/9/2011) CERN IT facility Number of systems8,792 Number of processors14,972 Memory modules55,729 Number of HDD's62,023 Number of RAID controllers3,607 Number of Fibre channel ports742 Number of 1G ports16,773 Number of 10G ports622

CERN IT Department CH-1211 Geneva 23 Switzerland t CF How often? Monitoring changes in serial numbers gives an idea CERN IT facility Bulk campaigns

CERN IT Department CH-1211 Geneva 23 Switzerland t CF How often? Monitoring changes in serial numbers gives an idea –Excluding campaigns ~170 disks /month (5 /day) CERN IT facility HDD failures/day:5 Hours/day:24  ~1 fail per 5hrs 64,000 drives in the centre  MTTF = 320,000 hrs (Spec: 1.2Mhrs)

CERN IT Department CH-1211 Geneva 23 Switzerland t CF When? Failure rates of hardware products typically follow a “bathtub curve” with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle 1. CERN IT facility 1

CERN IT Department CH-1211 Geneva 23 Switzerland t CF When? Process and categorize vendor calls according to ‘Warranty age’ when call was opened CERN IT facility 10x disks to CPU servers

CERN IT Department CH-1211 Geneva 23 Switzerland t CF When? Quarterly disk failure rate normalized to number of disks CERN IT facility Early failures (infant mortality)

CERN IT Department CH-1211 Geneva 23 Switzerland t CF When? Other failure types Swappable: RAM, PSU, BBU, BMC, … Complex repairs: cabling, backplane, main board, … no clue… CERN IT facility

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Repairs CERN IT facility Alarm Vendor call New sn: WD3342ABC

CERN IT Department CH-1211 Geneva 23 Switzerland t CF By who,? CERN IT facility Vendor

CERN IT Department CH-1211 Geneva 23 Switzerland t CF How quickly? Two contract types ‘Normal’ only used for CPU servers CERN IT facility TypeTime to interveneRepair time Normal24 working hours40 working hours Fast4 working hours12 working hours ~30%

CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN IT facility Ongoing Improvements Tracking changes to servers –Keep current tools that report HW info Controller 0: Vendor="Intel Corporation" Model="82801JI (ICH10 Family) SATA AHCI Controller" Location="/sys/devices/pci0000:00/0000:00:1f.2" BBU="None" Cache="None" Serial="None" Version="None" Driver="ahci" Type="sata” Controller 0 Port 0: Vendor="WDC" Model="WD1002FBYS-02A6B0" Size="953869" Serial="WD-WMATV " Version="03.00C06" Device="sda” Controller 0 Port 1: Vendor="WDC" Model="WD1002FBYS-02A6B0" Size="953869" Serial="WD-WMATV " Version="03.00C06" Device="sdb” Controller 0 Port 2: Vendor="WDC" Model="WD1002FBYS-02A6B0" Size="953869" Serial="WD-WMATV " Version="03.00C06" Device="sdc” BIOS: Vendor="American Megatrends Inc." Version=" (07/20/2009)" smt="enabled” BMC: Vendor="Winbond" Model="IPMI 2.0" IPMI Version="2.0" MAC="00:00:00:00:00:0A" Serial="" Version="1.12” CPU 0: Vendor="GenuineIntel" Model="Intel(R) Xeon(R) CPU 2.27GHz" Cores="4" Speed="2270” CPU 1: Vendor="GenuineIntel" Model="Intel(R) Xeon(R) CPU 2.27GHz" Cores="4" Speed="2270” NIC 0: Vendor="Intel Corporation" Model="82574L Gigabit Network Connection" Location="/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0" MAC="00:00:00:00:00:00" Speed=" " Bus="pci" Media="ethernet" Version="1.9-0” NIC 1: Vendor="Intel Corporation" Model="82574L Gigabit Network Connection" Location="/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0" MAC="00:00:00:00:00:0F" Speed=" " Bus="pci" Media="ethernet" Version="1.9-0” RAM 0: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM1A" Type="Other" Serial=” ” RAM 1: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM1B" Type="Other" Serial=" ” RAM 2: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM2A" Type="Other" Serial=" ” RAM 3: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM2B" Type="Other" Serial=" ” RAM 4: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM3A" Type="Other" Serial=" ” RAM 5: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P2-DIMM3B" Type="Other" Serial=" ” RAM 6: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM1A" Type="Other" Serial=" ” RAM 7: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM1B" Type="Other" Serial=" ” RAM 8: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM2A" Type="Other" Serial=" ” RAM 9: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM2B" Type="Other" Serial=" ” RAM 10: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM3A" Type="Other" Serial=" ” RAM 11: Vendor="Hyundai" Model="HMT151R7BFR4C-G7" Size="4096" Data Rate="1066" Location="P1-DIMM3B" Type="Other” Serial=" ” Serial: ”SDFGSDFG34DFGDFG345DFGDFG345" –Will store each server’s HW info as a document (HW inventory) –Key is unique id stored in the BMC when hardware is purchased –Change log, e.g. replaced parts, for each server –Goals: –Better accessibility and usability of data –Provide base for a more comprehensive HW inventory tool –Systematic tracking of parts replacement due to failure –Trending and potential action (e.g. #disk replacements in last month > X

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Conclusions Hardware fails –As expected –More often than expected MTTF ~320khours rather than 1.2Mhours –When expected: Effect of early failures (infant mortality) in first year No sign of wear-out at the end of the 3 years warranty Repairs are currently carried out by vendor –Missed repair targets in ~30% of cases –Looking at a different model… CERN IT facility

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Questions? CERN IT facility