Seminar on Enterprise Software Hardware Redundancy MTAT.03.240 Seminar on Enterprise Software Olgun Cakabey B06337 Othmar Mwambe B06324 December 2010
Agenda What is Redundancy? Introduction to Hardware Redundancy Hardware Components Disk Storage RAID (Redundant Array of Independent Disks) RAID Configurations Hardware Redundancy Techniques Conclusions References Demo
What is Redundancy? In engineering, Redundancy is the duplication of critical components of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe.
Concept of Redundancy Hardware redundancy is the addition of extra hardware, usually for the purpose of either detecting or tolerating faults. Software redundancy is the addition of extra software, beyond what is needed to perform a given function, to detect and possibly tolerate faults. Information redundancy is the addition of extra information beyond that required to implement a given function; for example, error detection codes. Time redundancy uses additional time to perform the functions of a system such that fault detection and often fault tolerance can be achieved. Transient faults are tolerated by this.
Introduction to Hardware Redundancy Hardware redundancy does not only concentrate on recovery from failures, but also on protection against them. Always demands trade off against achievable dependability. Costs: Additional components, area, power, shielding, ... Please Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components Computer Without Redundancy
Hardware Components There are several parts of computer systems which are highly considered when we are discussing about hardware redundancy which are CPU, Memory, Backplane and System Bus, I/O and Network Cards, Power Supplies, Cables and Connections. Some systems have a layer between the CPUs and the operating system, and this is sometimes called hypervisor
Disk Storage
Types of Storage Disks Disks are one of the most important parts of a computer system as they store the data, application programs, and operating systems. Data disks Operating system disks;eg.bootable cd
RAID (Redundant Array of Independent Disks) RAID is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. There are different Raid levels but They all follow the same idea: the data of one I/O request (read or write) coming from the computer system are sent to the Raid group and are distributed there to multiple disks enriched with redundant information to provide protection against disk failure(s).
RAID Configurations If a disk drive fails, the redundant Raid group is able to reconstruct the lost information. There two parameters which describe a stripe: the number of disks (also called stripe width) and the number of bytes written to a disk as a chunk.
How the reconstruction of data works Parity checking is a rudimentary method of detecting simple, single-bit errors in a memory system.
Raid0 block-level striping without parity or mirroring provides improved performance and additional storage but no redundancy or fault tolerance This combines several disks to one stripe with the goal that the I/O load is evenly distributed between the disks
Raid0
Raid1 mirroring without parity or striping This is first – and simplest – level for redundancy: data is written identically to multiple disks (a "mirrored set"). This minimizes overhead and provides good performance. Mirroring can decrease write performance slightly as twice the amount of data needs to be transferred
Raid1
Raid3 byte-level striping with dedicated parity Each single I/O request is distributed over all data disks. The performance of Raid3 is very good for large, single requests, as all disks are used equally. DisAdv: To reconstruct a failed drive, all the data needs to be read, which makes reconstruction much slower than with Raid1
Raid3
Raid5 block-level striping with distributed parity On small writes, Raid5 is inefficient. Each time a block is written, first the old data block and parity block need to be read DisAdv: Like Raid3, Raid5 has slow redundancy recovery times, since all the data needs to be read in order to reconstruct the lost data
Raid5
Raid6/Double Parity Raid It provides fault tolerance from two drive failures This makes larger RAID groups more practical, especially for high-availability systems
Raid6/Double Parity Raid
Raid10 and Raid01 Combining Stripes and Mirrors Sometimes it is useful to combine multiple Raid groups with different Raid levelsDisk outages in the Raid10 configuration leave the mirror intact, though without redundancy
Raid10 and Raid01
Comparison
Hardware Redundancy Techniques Passive techniques Active techniques Hybrid techniques
Passive Techniques Also known as static technique.. Implements fault masking Fault does not show up, since it is transparently removed No action from the system is required No reconfiguration - inherently fault tolerant Examples: Voting, correcting codes, N-modular redundancy (NMR), Flux Summing, special logic, TMR with duplex
Fault Masking Fault masking “hides” faults that occur. Do not require detecting faults, but require containment of faults (the effect of all faults should be local)
Active Techniques Also known as dynamic technique.. Actions required for correct result • detection, localization, containment, recovery • no fault masking Does not attempt to prevent faults from producing errors within the system After fault detection, the system is reconfigured to avoid a failure remove faulty hardware from system
Active Techniques (continued) Most common in applications that can tolerate temporary erroneous results – satellite systems - preferable to have temporary failures that high degree of redundancy Examples: Stand-by sparing, duplication with comparison, pair-and-a-spare, watchdog timer
Hybrid Techniques is combination of passive + active techniques fault masking + reconfiguration use fault masking to prevent erroneous results (prevent temporary errors) and provide spares to replace faulty hardware (high reliability)
Hybrid Techniques (continued) expensive, but better to achieve higher reliability and more fault tolerance Types: Self-purging redundancy, N-modular redundancy with spares, Triple-duplex architecture
Conclusions Redundancy is never for free!! Application-dependent choice – critical-computation - momentary erroneous results are not acceptable passive or hybrid – long-life, high-availability - system should be restored quickly • active – very critical applications - highest reliability • hybrid
References [1] SCHMIDT Klaus, High Availability and Disaster Recovery: Concepts, Design, Implementation, Springer, 2009 [2] http://en.wikipedia.org/wiki/Redundancy_%28engineering%29 [3] http://en.wikipedia.org/wiki/RAID [4] SIEWIOREK Daniel P, SWARZ Robert S., Reliable Computer Systems. third., Wellesley, MA : A. K. Peters, Ltd., 156881092X, 1998
THANK YOU ANY QUESTIONS?