Presentation is loading. Please wait.

Presentation is loading. Please wait.

25 Oct HEPiX1 Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division.

Similar presentations


Presentation on theme: "25 Oct HEPiX1 Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division."— Presentation transcript:

1 25 Oct 2002timm@fnal.gov HEPiX1 Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division Operating Systems Support Dept Scientific Computing Support Group

2 25 Oct 2002timm@fnal.gov HEPiX2 Introduction Four groups of Linux nodes have made it through three year life cycle (186 machines). All from commodity “white box” vendors Our goal—to measure the hardware failure rate and calculate total cost of ownership.

3 25 Oct 2002timm@fnal.gov HEPiX3 Burn-in and Service All nodes are given 30-day burn-in Test CPU with seti@homeseti@home Disk test with bonnie Network test with nettest Failures during burn-in period are vendor’s problem to fix (parts and labor). After burn-in period, there is 3 year warranty on parts, Fermilab covers the labor through on-site service provider Decision One. Lemon law—any node down for 5 straight days or 5 separate instances must be completely replaced.

4 25 Oct 2002timm@fnal.gov HEPiX4 Definition of Hardware Fault Failure of hardware such that it makes the machine not usable Hardware changed out during burn-in period doesn’t count Fan replacements (routine maintenance) don’t count. Sometimes we replaced a disk and it didn’t solve the problem…that is counted Multiple service calls in same incident count as single hardware fault.

5 25 Oct 2002timm@fnal.gov HEPiX5 Infant Mortality The routine hardware calls don’t count swap-outs during the burn-in period We expect and are prepared for initial quality problems. During install and burn-in, we have demanded and got total swap-outs of –motherboards (2 different times) –Cases (once) –Racks (once) –Power supplies (twice) –System disks (twice in same group of nodes)

6 25 Oct 2002timm@fnal.gov HEPiX6 IDE/DMA errors Serverworks LE chipset had broken IDE chipset Observed in following Pentium III boards: Tyan 2510, 2518, Intel STL2, SCB2, Supermicro 370DLE, ASUS CUR-DLS—basically anything for sale in 2001. (Tyan 2518 best of a bad lot). Hardware fault observed both in Windows and Linux and with hardware logic analyzer—Chipset thought DMA was still on even though drive had finished transfer. System most sensitive when trying to write system disk and swap at the same time.

7 25 Oct 2002timm@fnal.gov HEPiX7 IDE/DMA errors, cont’d Behavior varied by disk drive—Seagate disk drives—file corruption, Western Digital drives, occasional hangs of system, IBM drives—OK (up to 2.4.9 kernel). Vendor did 2 complete system disk swaps, first WD, then IBM. Problem reappears with new 2.4.18 kernel “feature”, shuts down the drive and halts the machine if one of these errors happens. Most IDE/DMA errors not counted in error summary below.

8 25 Oct 2002timm@fnal.gov HEPiX8 CPU Power—Fermi Cycles CPU clock speed numbers not consistent between Intel PIII, Intel Xeon (P4), and AMD Athlon MP SPEC CPU2000 numbers don’t exist far enough back for historical comparison We define PIII 1 GHz = 1000 Fermi Cycles Compilers that Fermi is tied to can’t give the full performance promised for SPEC CPU2000 numbers— AMD MP1800+ faster than Xeon 2.0GHz. Performance is measured by real performance of our applications on systems.

9 25 Oct 2002timm@fnal.gov HEPiX9 Farms Buying History Purch. Yr.Type# nodesCostFer. Cyc$ / FC Jun 1998PII 3333685128216000.25 Sep 1999PIII 5001504094001419000.34 Sep 2000PIII 75050212955750000.37 Jan 2001PIII 80040110410640000.57 Jun 2001PIII 10001363410602720000.8 Dec 2001PIII 10003261980640000.96 Feb 2002PIII 12661633768421761.24 Mar 2002PIII 12663277760843521.08 Sep 2002AMD20002404030008102402.01

10 25 Oct 2002timm@fnal.gov HEPiX10 First Linux farm 36 nodes, ran from 1998-2001 32 hardware failures—25 system disks, six power supplies, one memory. These nodes had only one disk, used for system, staging, swap, everything, and swapped heavily due to low memory. Failures correlated to power outages Rate—0.024 failures/machine-month.

11 25 Oct 2002timm@fnal.gov HEPiX11 Mini-tower farms, 1999 150 nodes, organized into 3 farms of 50. CDF, D0, Fixed Target, 50 each. Bought Sep 1999, just out of warranty now in Sep 2002. 140 nodes still in the farm, statistics based on them. 3 disks in each, one system and 2 data.

12 25 Oct 2002timm@fnal.gov HEPiX12 Mini-tower farms, cont’d. Fixed target—50 nodes, only 5 service calls over 3 years. 1 Memory problem, 1 bad data disk, 3 bad motherboards (one caused from failed BIOS upgrade). CDF—50 nodes, 19 service calls over 3 years 5 system disk, 2 power supply, 9 data disk, 2 motherboard, 1 CPU. D0—40 nodes, 18 service calls over 3 years 9 system disk, 2 power supply, 3 data disk, 3 motherboard, 1 network card.

13 25 Oct 2002timm@fnal.gov HEPiX13

14 25 Oct 2002timm@fnal.gov HEPiX14 Analysis Four different failure rates: Old farm—0.024 failures/machine month FT farm—0.0028+/-0.0012 failures/machine month CDF—0.0083+/- 0.0021 failures/machine month D0—0.0130+/-0.0044 failures/machine month Statistical analysis reveals the distributions are not statistically consistent with each other, also not Poisson. CDF and D0 are identical hardware in same computer room.

15 25 Oct 2002timm@fnal.gov HEPiX15 Analysis continued Failure rate could depend on any of the following –Frequency of use (D0 farm typically loaded > 98%, others less) –Vigilance of system administrators in finding and addressing hardware errors –Phase of moon. –Dependability of hardware. –Cooling efficiency

16 25 Oct 2002timm@fnal.gov HEPiX16 Residual value Latest farm purchase got us 2 Fermi cycles per dollar. Residual value of 140 nodes bought in 1999 is $70K—they could be replaced with 40 of the nodes we are buying today. Cost of electricity=180W*150 machines * 26280 hrs *.047$/kWh=$33.3K

17 25 Oct 2002timm@fnal.gov HEPiX17 Total Cost of Ownership Depreciation--$339K Maintenance--$20K (estimate) Electricity--$33K (estimate) Memory upgrades--$23K Total--$415K Personnel—2 FTE * 3 years—how much? (doesn’t count developer time, user time)

18 25 Oct 2002timm@fnal.gov HEPiX18 Lessons Learned Hitech has been out of business for more than a year Decision One was still able to get replacement parts from component vendors, at least for processors and disk drives Decision One identified replacement motherboard since initial one isn’t manufactured anymore. Conclusion—we can survive if a vendor doesn’t stay in business for the length of the 3 year warranty.

19 25 Oct 2002timm@fnal.gov HEPiX19 Cost forecast for 2U units Maintenance costs will be higher— –have already racked up $10K of maintenance in 1.5 years of deployment on 64 CDF nodes, for example. –Dominated by memory upgrades and disk swaps.

20 25 Oct 2002timm@fnal.gov HEPiX20 2U Intel boards: 50 2U nodes, D0, bought Sep. 00. 9 PS replaced during burn-in. Since then—1 system disk, 2 PS, 6 memory, 4 data disk, 6 motherboard, 1 net. Four nodes have been to shop > 3 times. 0.016 failures/machine month 23 nodes for CDF bought Jan ’01 1 system disk, 11 power supplies, 1 data disk, 1 network card so far. 0.031 failures/machine month

21 25 Oct 2002timm@fnal.gov HEPiX21 2U Supermicro boards 64 nodes for CDF bought Jun ’01 10 system disks, 2 data disks, 3 motherboards, 1 floppy, 2 batteries. (not to mention total swap of system disks twice) 0.010 failures/machine month 40 nodes for FT bought Jun ’01 Only 1 problem so far, memory. 0.002 failures/machine month. Identical hardware in 2 groups but failure rate is different by factor of five!

22 25 Oct 2002timm@fnal.gov HEPiX22 2U Tyan boards 32 bought for D0, arrived Dec 28, 2001 (after being sent back for new motherboards and cases). 3 hardware calls so far, all system disks. 0.003 failures/machine month 16 bought for KTeV, arrived March ’02 1 hardware call so far, data disk 0.009 failures/machine month 32 bought for CDF, arrived April ’02 2 hardware calls so far, system disk, CPU

23 25 Oct 2002timm@fnal.gov HEPiX23 SUMMARY FARM NAMEFailures/machine-month Old fnpc1-370.024 Fncdf1-500.010 Fnd01-400.012 Fnpc201-2500.003 Fnd051-1000.015 Fncdf51-730.027 Fncdf91-1540.018 Fnpc51-900.002 Fnd0101-1320.009 Fnpc1-160.008 Fncdf75-90,155-1700.009

24 25 Oct 2002timm@fnal.gov HEPiX24 Hardware errors by type

25 25 Oct 2002timm@fnal.gov HEPiX25 Conclusions thus far We now format a disk and check for bad blocks before placing service call to replace—it can often rescue a disk. At moment, software-related hangs are much greater problem than hardware errors and more time consuming to diagnose. With 750 machines and 0.01 failures/machine month we can expect 8 hardware failures/month. GRAND TOTAL—10692 machine-months so far, 0.0122 failures per machine-month. Machines currently running are averaging 0.0105 failures per machine-month.

26 25 Oct 2002timm@fnal.gov HEPiX26 Cluster errors over time

27 25 Oct 2002timm@fnal.gov HEPiX27


Download ppt "25 Oct HEPiX1 Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division."

Similar presentations


Ads by Google