Fault Tolerance Distributed Web-based Systems

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.

Last Class: Weak Consistency

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.

Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.

1 The Google File System Reporter: You-Wei Zhang.

Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.

CH2 System models.

DISTRIBUTED COMPUTING

Distributed Systems: Concepts and Design Chapter 1 Pages

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

Introduction to Fault Tolerance By Sahithi Podila.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Seminar On Rain Technology

Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.

SEMINAR TOPIC ON “RAIN TECHNOLOGY”

Applied Operating System Concepts

Chapter 1: Introduction

Chapter 1: Introduction

2. OPERATING SYSTEM 2.1 Operating System Function

Faults and fault-tolerance

Credits: 3 CIE: 50 Marks SEE:100 Marks Lab: Embedded and IOT Lab

Definition of Distributed System

Chapter 1: Introduction

EEC 688/788 Secure and Dependable Computing

Distribution and components

Fault Tolerance In Operating System

Chapter 1: Introduction

Introduction to Cloud Computing

Real-time Software Design

Chapter 1: Introduction

Chapter 16: Distributed System Structures

Distributed System Structures 16: Distributed Structures

RAID RAID Mukesh N Tekwani

An Introduction to Computer Networking

EECS 498 Introduction to Distributed Systems Fall 2017

Faults and fault-tolerance

EEC 688/788 Secure and Dependable Computing

Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel

Middleware for Fault Tolerant Applications

Introduction to Fault Tolerance

Operating System Concepts

EEC 688/788 Secure and Dependable Computing

Chapter 1: Introduction

Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.

Baisc Of Software Testing

Subject Name: Operating System Concepts Subject Number:

EEC 688/788 Secure and Dependable Computing

Chapter 1: Introduction

Chapter 1: Introduction

Chapter 1: Introduction

EEC 688/788 Secure and Dependable Computing

RAID RAID Mukesh N Tekwani April 23, 2019

Chapter 1: Introduction

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Operating System Concepts

Abstractions for Fault Tolerance

Distributed Systems and Concurrency: Distributed Systems

Chapter 1: Introduction

Seminar on Enterprise Software

Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility A well.

Presentation transcript:

Fault Tolerance Distributed Web-based Systems BY Avinash Thadamatha

Introduction Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation

Why Fault Tolerance? Fault tolerance is needed in order to provide 3 main features to distributed systems 1) Reliability 2) Availability 3) Security

Distributed Systems A distributed system is a collection of loosely coupled nodes interconnected by a communication network. From the point of view of a specific node in a distributed system, the rest of the nodes and their respective resources are remote, whereas its own resources are local. The nodes in a distributed system may vary in size and function. They may include small microprocessors, personal computers, and large general- purpose computer systems.

Why? There are four major reasons for building distributed systems: Resource sharing Computation speedup Reliability Communication

Web based Distributed Systems Fundamentally same as other distributed systems Relatively simple client-server architecture Access to local file system Several different protocols

Fault Tolerance Systems Dependability covers some useful requirements in the fault tolerance system these requirements include: Availability, Reliability, Safety, and Maintainability. Availability: This is when a system is in a ready state, and is ready to deliver its functions to its corresponding users. Highly available systems works at a given instant in time. Reliability: This is the ability for a computer system run continuously without a failure. A highly reliably system, works constantly in a long period of time without interruption. Safety: This is when a system fails to carry out its corresponding processes correctly and its operations are incorrect, but no shattering event happens. Maintainability: A highly maintainability system can also show a great measurement of accessibility, especially if the corresponding failures can be noticed and fixed mechanically

Errors caused by fault tolerance events are separated into categories namely; Performance Omission Timing Crash fail-stop

Performance: this is when the hardware or software components cannot meet the demands of the user. Omission: is when components cannot implement the actions of a number of distinctive commands. Timing: this is when components cannot implement the actions of a command at the right time. Crash: certain components crash with no response and cannot be repaired. Fail-stop: is when the software identifies errors, it ends the process or action, this is the easiest to handle, sometimes its simplicity deprives it from handling real situations.

In addition to the error timing, three situations or form can be distinguished: 1) Permanent error; these causes damage to software components and resulting to permanent error or damage to the program, preventing it from running or functioning. In this case a restart of the program is done, an example is when a program crashes. 2) Temporary error; this only result to a brief damage to the software component, the damage gets resolved after some time and the corresponding software continues to work or function normally. 3) Periodic errors; these are errors that occurs occasionally. For example when there’s a software conflict between two software when run at the same time. In dealing with this type of error, one of the programs or software is exited to resolve the conflict

Basic concepts of FT systems Fault tolerance mechanism can be divided into three stages; Hardware, Software, and System Fault Hardware Fault Tolerance: This involves the provision of supplementary backup hardware such as; CPU, Memory, Hard disks, Power Supply Units, etc. hardware fault tolerance can only deliver support for the hardware by providing the basic hardware backup system, it can’t stop or detect error, accidental interfering with programs, program errors, etc. Two approaches to hardware fault recovery Fault masking Dynamic recovery

Software Fault Tolerance: This is a special software designed to tolerate errors that would originate from a software or programming errors. The software fault tolerance utilize the static and dynamic redundancy methods similar to those used for hardware fault System Fault Tolerance: This is a complete system that stores not just checkpoints, it detects error in application, it stores memory block, program checkpoint automatically. When a fault or an error occurs, the system provides a correcting mechanism thereby correcting the error.

Comparision b/w different ft techniques

Replication

Type 1 of Replication

Check Pointing

User triggered •Requires User Intervention. •Can be useful only when users have understanding of the computation. •It is not easy to identify when a check point should be created Uncoordinated Check pointing • Also known as Independent check pointing. •Processes do not communicate with each other and creates their own check points. •No communication overhead Coordinated Check points •Processes communicate with each other. •firstly, temporary check points are created and then made permanent. •Recovery time is high, due to communication. Message Based Suitable when communication is through message passing only. The state of the processes is stored in the form of message. if one process goes down , other takes its place and acquire its state with the help of messages.

Limitations

Conclusion Fault tolerance is a major part of distributed system, because it ensures the continuity and functionality of a system at a point where there’s a fault or failure. This research showed the different type of fault tolerance technique in distributed system such as the Check Pointing and Replication Based Fault Tolerance Technique. Each mechanism is advantageous over the other and costly in deployment. Software fault tolerance system comprises of checkpoints storage and rollback recovery mechanisms, and the system fault tolerance is a complete system that does both software and hardware fault tolerance, to ensure availability of the system during failure, error or fault.

Future Work Future research would be conducted on comparing the various data security mechanisms and their performance metrics.

Thank You