1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Chapter 19: Network Management Business Data Communications, 5e.
Sensor Network 教育部資通訊科技人才培育先導型計畫. 1.Introduction General Purpose  A wireless sensor network (WSN) is a wireless network using sensors to cooperatively.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Chapter 13 (Web): Distributed Databases
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Applying Genetic Algorithms to Decision Making in Autonomic Computing Systems Authors: Andres J. Ramirez, David B. Knoester, Betty H.C. Cheng, Philip K.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Ensuring Non-Functional Properties. What Is an NFP?  A software system’s non-functional property (NFP) is a constraint on the manner in which the system.
Overview Distributed vs. decentralized Why distributed databases
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Real-Time Distributed Databases By: Chris Scardino CSC536 Monday, May 2, 2005.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Learning from the Past for Resolving Dilemmas of Asynchrony Paul Ezhilchelvan and Santosh Shrivastava Newcastle University England, UK.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Client-Server Computing in Mobile Environments
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
6.4 Data And File Replication Presenter : Jing He Instructor: Dr. Yanqing Zhang.
Storage Allocation in Prefetching Techniques of Web Caches D. Zeng, F. Wang, S. Ram Appeared in proceedings of ACM conference in Electronic commerce (EC’03)
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Practical Byzantine Fault Tolerance
Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
1 Distributed Databases BUAD/American University Distributed Databases.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Chap 7: Consistency and Replication
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
The CoBFIT Toolkit PODC-2007, Portland, Oregon, USA August 14, 2007 HariGovind Ramasamy IBM Zurich Research Laboratory Mouna Seri and William H. Sanders.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.
Secure Location-Independent Autonomic Storage Architectures GR/S44501/01 February January 2007 Graham Kirby, Alan Dearle, Ron Morrison & Stuart.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Chapter 1 Database Access from Client Applications.
Control-Theoretic Approaches for Dynamic Information Assurance George Vachtsevanos Georgia Tech Working Meeting U. C. Berkeley February 5, 2003.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
Distributed Databases
Distributed Computing Systems Replication Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
Replication Chapter Katherine Dawicki. Motivations Performance enhancement Increased availability Fault Tolerance.
BChain: High-Throughput BFT Protocols
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Providing Secure Storage on the Internet
Introduction There are many situations in which we might use replicated data Let’s look at another, different one And design a system to work well in that.
Principles of Computer Security
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Active replication for fault tolerance
Assignment 8 - Solution Problem 1 - We replicate database DB1.
Replication and Availability in Distributed Systems
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Presentation transcript:

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü

2 OPEN DISTRIBUTED SYSTEMS One of the most succesfull structures designed in computer community Have side-effects as:  Unanticipated runtime events  Reconfiguration burdens due to environmental changes  Increasing complexity limits development

3 OPEN DISTRIBUTED SYSTEMS Reliability depends on both failures and performance Required Reliability has to be maintained A set of complex requirements needed due to fluctuations in the environment and its unpredictability

4 ACTIVE FAULT-TOLERANT MODEL Exploits the knowledge of pre-fault behaviour to predict environmental faults and failures Reduces the unpredictable nature of failures upto a certain limit Provides proactive approach to achieve required reliability

5 ACTIVE FAULT-TOLERANT MODEL Tolerates current failures that could not be predicted Maintains user specified reliability by proper replication strategies Uses the information extracted from the system

6 ACTIVE FAULT-TOLERANT MODEL

7 PROACTIVE APPROACH of AFT MODEL Design a mechanism to forecast faults and failures If AFT predicts a high chance of system failure it takes necessary steps to avoid failure Aim is to employ available information about suspected failures to provide required reliability

8 REAL-TIME APPROACH of AFT MODEL Some failures can not be predicted before they actually occur Based on real-time decision making and reconfiguring according to current failures First identifies then tolerates by adaptation strategies

9 AFT STRATEGIES Replication is a complex function Replication degree, Replica placement, Replication protocol, Communication between replicas A single replication strategy is not enough to achieve the required reliability

10 ADJUSTING the DEGREE of REPLICATION Optimal degree of replication can be achieved by AFT model AFT policy may increase the degree of replication if a failure is more probable AFT policy may decrease the degree of replication if a member leaves the system or to reduce communication costs

11 MIGRATION of CURRENT REPLICAS Reliability does not depend on just number of replicas, but also their placement Prime concern: which nodes should host replicas Workload, storage capacity, bandwidth, reliability of server is concerned

12 SHIFTING into a SUITABLE REPLICATION PROTOCOL ADAPTIVELY

13 PRIMARY COPY REPLICATION Any update of data sent to the primary copy first Updates are propagated to back-up nodes asynchronously Efficient in terms of communication when lots of write messages occur Single point of failure problems

14 READ-ONE WRITE-ALL REPLICATION Updates are performed anywhere in the system Important when information has to be replicated immediately Efficient when dealing with failures Slow when significant amount of write operations needed

15 MAJORITY REPLICATION It is an intermediate solution between the Primary Copy and ROWA replication May be done in pair-wise manner Principle selection is based on the trade of between reliability and communication cost

16 SHIFTING into a SUITABLE REPLICATION PROTOCOL ADAPTIVELY

17 RELAXED vs STRICT Message Synchronization depends on network traffic by replication and communication overheads Relaxed: –A set of updates in a single message within a time period –Less traffic –Guarantees consistency at a certain point –Loss of work is higher during a failure –Not consistent but efficient Strict: –Each update by a single message –More traffic –Consistent at each point –Consistent but expensive

18 DESIGN of AFT MODEL ON JUICE OBJECT Juice Model: Model for each replica –Based on adaptable object model –Reconfigures its internal object at run time –Consists of five internal elements

19 DESIGN of AFT MODEL ON JUICE OBJECT AFT provides adaptation facilities as designed on the Juice Object model Adaptation Handler(AH), Replication Handler(RH), Underlying System Information Evaluator(USIE), Client Member Information Evaluator (CMIE)

20 AFT FRAMEWORK Collection of Information USIE runs on each replica to collect the local resource information: usage patterns of resources, information of underlying system failures Each machine holds a monitor object

21 Collection of Information CMIE handles both the current replica’s information and most recently connected client’s information(message failure rate, response time, network latency) Gathered from the communicator of the Juice Model

22 Collection of Information

23 Information Analysis Adaptation Handler(AH) analyses the suspected or known system faults and failures using the available information Predicts future faults and estimates current reliability of the system Carries out a cost-benefit analysis considering user requirements If needed AH selects the best strategy –Number of replicas, placement, replication protocol

24 Information Analysis Selection of a suitable protocol should follow agreement of all AH’s of the replica group One random member collects the votes of the replicas Replicas switch to new protocol simultaneously according to the decision

25 Execution of New Strategy AH notifies Replication Handler(RH) to replace themselves with the new object Since the model is based on two configuration levels switching between strategies does not lead to inconsistencies

26 CONCLUSION Describes the design of AFT model which allows user to specify reliability and performance AFT employs a combination of proactive and real-time fault-tolerant approachs in open- distributed systems Proactive approach exploits the knowledge from USIE & CMIE to warn against probable faults, reduce the failures and increase the performance significantly

27 CONCLUSION Real-time approach deals with the current faults A single replication protocol can not cope with environmental fluctuations AFT uses three main strategies to fullfill the needs of the system AFT allows the system to reconfigure and execute under different situations and therefore tightly integrated with the environmenral changes

28 REFERANCE Lanka R., Oda K., Yoshida T.: ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING. Autonomic and Trusted Computing, (2006)

29 QUESTIONS? THANK YOU FOR LISTENING