Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

What is RAID Redundant Array of Independent Disks.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID A RRAYS Redundant Array of Inexpensive Discs.
Triple-Parity RAID and Beyond Hai Lu. RAID RAID, an acronym for redundant array of independent disks or also known as redundant array of inexpensive disks,
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Database Administration and Security Transparencies 1.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
WHAT IS RAID? Christopher J Dutra Seton Hall University.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
File Management Systems
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
High Performance Computing Course Notes High Performance Storage.
DISTRIBUTED COMPUTING
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
By : Nabeel Ahmed Superior University Grw Campus.
RAID Redundancy is the factor for development of RAID in server environments. This allows for backup of the data in the storage in the event of failure.
Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.
Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
RAID Shuli Han COSC 573 Presentation.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Redundant Array of Independent Disks
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CSI-09 COMMUNICATION TECHNOLOGY FAULT TOLERANCE AUTHOR: V.V. SUBRAHMANYAM.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Module 9: Configuring Storage
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.
Windows Server 2003 硬碟管理與磁碟機陣列 林寶森
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Davie 5/18/2010.  Thursday, May 20 5:30pm  Ursa Minor  Co-sponsored with CSS  Guest Speakers  Dr. Craig Rich – TBA  James Schneider – Cal Poly.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Network Operating Systems : Tasks and Examples Instructor: Dr. Najla Al-Nabhan
RAID Systems Ver.2.0 Jan 09, 2005 Syam. RAID Primer Redundant Array of Inexpensive Disks random, real-time, redundant, array, assembly, interconnected,
1/28/2010 Network Plus Malware and Ensuring Availability.
Install, configure and test ICT Networks
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
TCP/IP Protocol Suite Suresh Kr Sharma 1 The OSI Model and the TCP/IP Protocol Suite Established in 1947, the International Standards Organization (ISO)
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Configuring File Services
I/O System Chapter 5 Designed by .VAS.
Storage Virtualization
RAID RAID Mukesh N Tekwani
Fault Tolerance Distributed Web-based Systems
TECHNICAL SEMINAR PRESENTATION
UNIT IV RAID.
RAID RAID Mukesh N Tekwani April 23, 2019
Seminar on Enterprise Software
Presentation transcript:

Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems

Fault-tolerance or graceful degradation is the property that enables a system (often computer- based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems. Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount. Fault-Tolerance

A conceptual design of a segregated- component fault-tolerant computer design Most fault-tolerant computer systems are designed to be able to handle several possible failures, including hardware-related faults such as hard disk failures, input or output device failures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source. Fault Tolerant Computer Systems

RAID, an acronym for Redundant Array of Inexpensive Disks or Redundant Array of Independent Disks, is a technology that allows high levels of storage reliability from low-cost and less reliable PC-class disk-drive components, via the technique of arranging the devices into arrays for redundancy. All implementations of RAID, redundant array of independent disks, except RAID 0 are examples of a fault-tolerant storage device that uses data redundancy.RAIDredundant array of independent disksstorage devicedata redundancy RAID RAID 0 (striped disks) distributes data across multiple disks in a way that gives improved speed at any given instant. If one disk fails, however, all of the data on the array will be lost, as there is neither parity nor mirroring, that is, RAID 0 is not redundant. RAID 1 mirrors the contents of the disks, making a form of 1:1 ratio realtime backup. The contents of each disk in the array are identical to that of every other disk in the array. A RAID 1 array requires a minimum of two drives. RAID 3 or 4 (striped disks with dedicated parity) combines three or more disks in a way that protects data against loss of any one disk. Fault tolerance is achieved by adding an extra disk to the array, which is dedicated to storing parity information; the overall capacity of the array is reduced by one disk. RAID 5 Striped set with distributed parity or interleave parity requiring 3 or more disks. Distributed parity requires all drives but one to be present to operate; drive failure requires replacement, but the array is not destroyed by a single drive failure. Upon drive failure, any subsequent reads can be calculated from the distributed parity such that the drive failure is masked from the end user. RAID 6 (striped disks with dual parity) combines four or more disks in a way that protects data against loss of any two disks. RAID 1+0 (or 10) is a mirrored data set (RAID 1) which is then striped (RAID 0), hence the "1+0" name. A RAID 1+0 array requires a minimum of four drives. That is two mirrored drives to hold half of the striped data, plus another two mirrored for the other half of the data. RAID 0+1 (or 01) is a striped data set (RAID 0) which is then mirrored (RAID 1). A RAID 0+1 array requires a minimum of four drives: two to hold the striped data, plus another two to mirror the first pair.

The basic characteristics of fault tolerance require: In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide % availability. 1. No single point of failure 2. No single point of repair 3. Fault isolation to the failing component 4. Fault containment to prevent propagation of the failure 5. Availability of reversion modes Characteristics of Fault Tolerance

Spare components addresses the first fundamental characteristic of fault-tolerance in three ways: Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum; Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover); Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation. No Single Point of Failure

No Single Point of Repair If a system experiences a failure, it must continue to operate without interruption during the repair process. Fault Tolerant servers surpass the concept of high availability to enter the era of the "continuous availability". Such servers are designed to guarantee an availability of %, that is to say on average less than 5 minutes of unplanned interruption per year, including time necessary for repairs, updates, and general maintenance.

Fault isolation to the failing component When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Cisco IOS XR Software Architecture Cisco IOS XR Software is the first fully modular, fully distributed internetwork operating system built on a microkernel-based, memory-protected architecture that strictly segments all operating system components, from device drivers and file systems to management interfaces and routing protocols, helping to ensure complete process separation and fault isolation.

Fault containment Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. From a computing perspective, Containers are objects that can `hold' a collection of other objects or entities. The Cisco hierarchical Containment Model can reflect the real world topology of the network that is being modelled, in a physical, logical or business-oriented sense. Fault extensions for fault management and root cause analysis. Configuration extensions for config- uration management and policy assurance. Accounting extensions for the accounting aspect of network management. Performance extensions for monitoring and analyzing a network's performance. Security extensions for managing a network's security.