Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala

Slides:



Advertisements
Similar presentations
Hard Disks Low-level format- organizes both sides of each platter into tracks and sectors to define where items will be stored on the disk. Partitioning:
Advertisements

Categories of I/O Devices
CS4315A. Berrached:CMS:UHD1 Operating Systems and Computer Organization Chapter 4.
Linux on commodity network H/W Josh Parsons LUGOD talk August 15 th 2005.
Motherboard, BIOS and POST The external data bus connects devices on the motherboard together. Everything is also connected to the address bus. These busses.
Nummenmaa & Thanish: Practical Distributed Commit in Modern Environments PDCS’01 PRACTICAL DISTRIBUTED COMMIT IN MODERN ENVIRONMENTS by Jyrki Nummenmaa.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.
Linux vs. Windows. Linux  Linux was originally built by Linus Torvalds at the University of Helsinki in  Linux is a Unix-like, Kernal-based, fully.
1 Recap (RAID and Storage Architectures). 2 RAID To increase the availability and the performance (bandwidth) of a storage system, instead of a single.
Handheld TFTP Server with USB Andrew Pangborn Michael Nusinov RIT Computer Engineering – CE Design 03/20/2008.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Hosted VMM Architecture Advantages: –Installs and runs like an application –Portable – host OS does I/O access –Coexists with applications running on.
Slide 1 Computers for the Post-PC Era Aaron Brown, Jim Beck, Kimberly Keeton, Rich Martin, David Oppenheimer, Randi Thomas, John Kubiatowicz, Kathy Yelick,
1 Web Server Administration Chapter 2 Preparing For Server Installation.
Slide 1 Computers for the Post-PC Era Aaron Brown, Jim Beck, Kimberly Keeton, Rich Martin, David Oppenheimer, Randi Thomas, John Kubiatowicz, Kathy Yelick,
Using Technology in the FL Classroom: An Introduction By Sandy Dugan.
Using Fault Model Enforcement (FME) to Improve Availability EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Secondary Storage Unit 013: Systems Architecture Workbook: Secondary Storage 1G.
1 Chapter Overview Preparing to Install Windows XP Professional Installing Windows XP Professional from a CD-ROM Installing Windows XP Professional over.
RAID Redundancy is the factor for development of RAID in server environments. This allows for backup of the data in the storage in the event of failure.
Windows Debugging Demystified
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
Managing Storage Lesson 3.
Memory/Storage Architecture Lab Computer Architecture Lecture Storage and Other I/O Topics.
Host and Application Security Lesson 4: The Win32 Boot Process.
Administering Windows 7 Lesson 11. Objectives Troubleshoot Windows 7 Use remote access technologies Troubleshoot installation and startup issues Understand.
Determining an Internet Address at Startup
Server Hardware Chapter 22 Release 22/10/2010Jetking Infotrain Ltd.
ITE 1 Chapter 5. Chapter 5 is a Large Chapter It has a great deal of useful information about operating systems. You will find this VERY helpful when.
Slides created by: Professor Ian G. Harris Test and Debugging  Controllability and observability are required Controllability Ability to control sources.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
The Basic Input/Output System Unit objectives: Access the BIOS setup utility, change hardware configuration values, and research BIOS updates Explain the.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
Chapter 16 Designing Effective Output. E – 2 Before H000 Produce Hardware Investment Report HI000 Produce Hardware Investment Lines H100 Read Hardware.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
1 Web Server Administration Chapter 2 Preparing For Server Installation.
1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.
Keller and Ozment (1999)  Problems of driver turnover  Costs $3,000 to $12,000 per driver  Shipper effect  SCM impact  Tested solutions  Pay raise.
The SBE Method Solutions Based Engineering By: Ralph M. DeFrangesco BSCS, MBA, PhD candidate
A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.
Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001.
I/O Computer Organization II 1 Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate:
10-Jun-2005 OWAMP (One-Way Active Measurement Protocol) Jeff Boote Network Performance Workshop.
Lecture (Mar 23, 2000) H/W Assignment 3 posted on Web –Due Tuesday March 28, 2000 Review of Data packets LANS WANS.
RAL Site report John Gordon ITD October 1999
Thomas Schwarz, S.J. Qin Xin, Ethan Miller, Darrell Long, Andy Hospodor, Spencer Ng Summarized by Leonid Kibrik.
Virtualization Supplemental Material beyond the textbook.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
 LAN ADVANTAGE  Workstations can share peripherals devices like printers. Cheaper that providing a printer for each computer.  Workstations do not.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
Troubleshooting Windows Vista Lesson 11. Skills Matrix Technology SkillObjective DomainObjective # Troubleshooting Installation and Startup Issues Troubleshoot.
Unit Hardware Troubleshooting
Virtualization overview
Chapter 2 Objectives Identify Windows 7 Hardware Requirements.
Introduction I/O devices can be characterized by I/O bus connections
Computers for the Post-PC Era
Web Server Administration
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Lecture9: Embedded Network Operating System: cisco IOS
Hard disk basics Prof:R.CHARLES SILVESTER JOE Departmet of Electronics St.Joseph’s College,Trichy.
Universal Serial Bus (USB)
LO1 – Understand Computer Hardware
Lecture9: Embedded Network Operating System: cisco IOS
Presentation transcript:

Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala

Slide 2 Motivation ISTORE: –Proposes to take advantage of predicted failures to improve system robustness –Uses a switched network design to connect intelligent devices to each other to improve system performance. »Therefore ISTORE devices do not share electrical connections »Is this another ISTORE advantage? This talk examines: –The potential to predict failures for disk devices –If and how the failure of a device sharing electrical connections with other devices affects those other devices

Slide 3 Just Before a Disk Fails... Can we predict the disk failure? To answer we will investigate: –What kind of log messages does the system generate? –When do these messages get generated? –How do we distinguish a failing disk from a non-failing disk? Are the other connected devices in the system affected in any way? To answer we will investigate: –Are there correlations between the logged messages?

Slide 4 * Which Logs on What System? –The Error Logs Generated by Berkeley’s Tertiary Disk System –Log Dates: January to November, 1998 * The Tertiary Disk Application –A WEB Accessible Image Collection –Available 24 hours/day, 7 days/week

Slide 5 Outline * Tertiary Disk Architecture Example of a log Message What Kind of Messages are generated? Can we predict the disk failure? Are the other connected devices in the system affected in any way? Summary and Conclusion

Slide 6 The Tertiary Disk Architecture 20 PCs (m0-m19): –200 MHz Pentium Pros –96 MB of RAM –Running FreeBSD version 2.2 –Connected through a switched Ethernet network –Hosts a set of disks using fast-wide SCSI 2 in the single ended mode »Using twin channel SCSI controllers Total of 368 Disks –8 GB each –State of the Art in 1996

Slide 7 The Tertiary Disk Architecture 4 PCs (m0 - m3) have 28 or more disks each: –2-3 SCSI Chains per PC –9-15 Disks per SCSI chain 16 PCs (m4 - m19) have 16 disks each: –2 SCSI Chains per PC –8 Disks per SCSI chain SCSI bus made up of: –SCSI cable: Connects the controller and enclosure –Backplane of the enclosure

Slide 8 The Tertiary Disk Architecture To Ethernet Switch SCSI Cable SCSI Backplane Disk Enclosure SCSI Controller Ethernet Terminator

Slide 9 Outline Tertiary Disk Architecture * Example of a log Message What Kind of Messages are generated? Can we predict the disk failure? Are the other connected devices in the system affected in any way? Summary and Conclusion

Slide 10 Example of A Log Message Oct 22 14:53:50 m6 /kernel: (da1:ahc0:0:1:0): WRITE(06). CDB: a c b1 bf 80 0 Oct 22 14:53:50 m6 /kernel: (da1:ahc0:0:1:0): HARDWARE FAILURE info:cb1bf asc:44,0 Oct 22 14:53:50 m6 /kernel: (da1:ahc0:0:1:0): Internal target failure field replaceable unit: 1 sks:80,3 Month Day Time --> Oct 22 14:53:50 Machine name --> m6 Source of message --> kernel reporting message Error Device --> disk = da1, SCSI bus = ahc0 Description of Error --> Write request had a write fault and caused a HW Failure More information --> Driver & SCSI Controller Codes

Slide 11 Outline Tertiary Disk Architecture Example of a log Message * What Kind of Messages are generated? Can we predict the disk failure? Are the other connected devices in the system affected in any way? Summary and Conclusion

Slide 12 What kind of messages are generated? Data Disk Error Messages: –Hardware Error: The command unsuccessfully terminated due to a non-recoverable hardware failure. (Type is given in the message) –Medium Error: The operation was unsuccessful due to a flaw in the medium --> usually recommends reassigning sectors –Recoverable Error: The last command completed with the help of some error recovery at the target --> e.g. if the drive dynamically reassigned a bad sector to available spare sector –Not Ready: The drive cannot be accessed at all SCSI Error Messages: –Time Outs: Can happen in any of the SCSI bus phases, i.e. message, data, idle. Response: a BUS RESET command –Parity: Cause of an aborted request

Slide 13 Outline Tertiary Disk Architecture Example of a log Message What Kind of Messages are generated? * Can we predict the disk failure? Are the other connected devices in the system affected in any way? Summary and Conclusion

Slide 14 m0: SCSI Time Outs+Recovered Errors SCSI Bus 0

Slide 15 m0: SCSI Time Outs+Recovered Errors SCSI Bus 4

Slide 16 m0: SCSI Time Outs+Recovered Errors SCSI Bus 0

Slide 17 m0: SCSI Time Outs+Recovered Errors SCSI Bus 0

Slide 18 Can we predict a disk failure? Yes, we can look for Recovered Error messages --> on : –There were 433 Recovered Error Messages –These messages lasted for slightly over an hour between: »12:43 and 14:10 On : Disk 5 on m0 was “fired”, i.e. it was about to fail so it was swapped Another example...

Slide 19 m11: SCSI Time Outs SCSI Bus 0

Slide 20 m11: SCSI Time Outs + Hardware Failures SCSI Bus 0

Slide 21 Can we predict a disk failure? Yes, we can also look for Hardware Failure messages --> –These messages lasted for 8 days between: » and –On disk 9 there were: »1763 Hardware Failure Messages, and »297 Timed Out Messages Disk 9 on SCSI Bus 0 of m11 was “fired”, i.e. it was about to fail so it was swapped on

Slide 22 Outline Tertiary Disk Architecture Example of a log Message What Kind of Messages are generated? Can we predict the disk failure? * Are the other connected devices in the system affected in any way? Summary and Conclusion

Slide 23 Are the other connected devices in the system affected in any way? Yes, observe the Time Out message traffic on other disks on the same SCSI bus for --> –The same 8 day period: » and What about predicting other kinds of failures besides just disk failures? --> –Distinguishing between failing and non-failing disks...

Slide 24 m2: SCSI Bus 2 Parity Errors

Slide 25 m2: SCSI Bus 2 Parity Errors

Slide 26 Can We Predict Other Kinds of Failures? Yes, the flurry of parity errors on m2 occurred between: – and , as well as – and On –m2 had a bad enclosure --> cables or connections defective –The enclosure was then swapped Note: The activity logs are not available for the earlier time period.

Slide 27 Can We Distinguish a Failing Disk From a Non-Failing Disk? Yes... SCSI Error Messages alone --> No impending disk failure –As in the m2 Parity example Disk Error Messages alone or accompanied by SCSI Error Messages --> High Probability of an impending disk failure e.g. –ALONE: m0 had only Recovered Error Messages: »Disk 5 was about to fail and therefore was “fired” –BOTH: m11 had both Hardware Failure Disk Messages and Time Out SCSI Messages: »Disk 9 was about to fail and therefore was “fired”

Slide 28 Outline Tertiary Disk Architecture Example of a log Message What Kind of Messages are generated? Can we predict the disk failure? Are the other connected devices in the system affected in any way? * Summary and Conclusion

Slide 29 Total Disk & SCSI Errors Per Machine

Slide 30 Summary and Conclusion Disks don’t fail very often –In the 10 months of logs, only two disks failed –We have only 2 data points for these conclusions! We can predict disk failures and other kinds of failures with enough time to do something about it There are correlations between the logged messages: –Hardware Failure Messages on one disk device propagates as Time Out Messages on: »not only the failing disk, »but also other disks on the same SCSI bus

Slide 31 Back Up Slides

Slide 32 m0: SCSI Time Outs SCSI Bus 2