1 Availability Study of Dynamic Voting Algorithms Kyle Ingols and Idit Keidar MIT Lab for Computer Science.

Slides:



Advertisements
Similar presentations
An Adaptive Compulsory Protocol for Basic Communication in Ad-hoc Mobile Networks Ioannis Chatzigiannakis Sotiris Nikoletseas April 2002.
Advertisements

Chapter 8 Fault Tolerance
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Principles of Reliable Distributed Systems Lecture 6: Synchronous Uniform Consensus Spring 2005 Dr. Idit Keidar.
Fault-Tolerance in the Borealis Distributed Stream Processing System Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Michael Stonebraker MIT.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Algorithm for Virtually Synchronous Group Communication Idit Keidar, Roger Khazan MIT Lab for Computer Science Theory of Distributed Systems Group.
Transis Efficient Message Ordering in Dynamic Networks PODC 1996 talk slides Idit Keidar and Danny Dolev The Hebrew University Transis Project.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.
5 Qubits Error Correcting Shor’s code uses 9 qubits to encode 1 qubit, but more efficient codes exist. Given our error model where errors can be any of.
Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
Last Class: Weak Consistency
1 Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group Paradigms for Building Distributed Systems: Performance Measurements and.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Transis Dynamic Voting for Consistent Primary Components PODC 1997 talk slides Esti Yeger Lotem, Idit Keidar and Danny Dolev The Hebrew University
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
Team Members Lora zalmover Roni Brodsky Academic Advisor Professional Advisors Dr. Natalya Vanetik Prof. Shlomi Dolev Dr. Guy Tel-Zur.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Multi-hop-based Monte Carlo Localization for Mobile Sensor Networks
Chapter 13: Regression Testing Omar Meqdadi SE 3860 Lecture 13 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.
Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
10 th December, 2013 Lab Meeting Papers Reviewed:.
Fault Tolerance Computer Programs that Can Fix Themselves! Prof’s Paul Sivilotti and Tim Long Dept. of Computer & Info. Science The Ohio State University.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Databases Illuminated
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
November NC state university Group Communication Specifications Gregory V Chockler, Idit Keidar, Roman Vitenberg Presented by – Jyothish S Varma.
December 4, 2002 CDS&N Lab., ICU Dukyun Nam The implementation of video distribution application using mobile group communication ICE 798 Wireless Mobile.
Classification Ensemble Methods 1
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Transactions on Replicated Data Steve Ko Computer Sciences and Engineering University at Buffalo.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Fail-stutter Behavior Characterization of NFS
Prepared by Ertuğrul Kuzan
Outline Announcements Fault Tolerance.
Reliable Distributed Systems
Effective Replica Allocation
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Presentation transcript:

1 Availability Study of Dynamic Voting Algorithms Kyle Ingols and Idit Keidar MIT Lab for Computer Science

2 Primary Component For fault-tolerance with partitions –Network can partition to several components –One component is primary Only members of primary component can modify shared data Usually, primary contains majority (quorum)

3 Dynamic Voting Defines quorums adaptively New primary component contains majority of previous one Example: {1, 2, 3, 4, 5, 6, 7, 8, 9} {1, 2, 3, 4, 5} {2, 3, 4} {3, 4, 6, 10, 11}

4 Dynamic Voting Benefits Dynamic universe –New processes can join at any time –Processes that leave likely to not return Higher availability –Repeated failures may reduce chance of connected majority

5 Previous Availability Studies Stochastic analysis, stochastic Petri nets, simulations, empirical measurements, … Assume static universe Show that dynamic voting leads to primary component being formed most often

6 Previous Studies Overlooked…. Transition to new primary cannot be atomic in a distributed system

7 Bug Example Initially, all connected {1, 2, 3, 4, 5} 1, 2 suspect 4, 5, move to {1, 2, 3} 3 does not move with them, detaches 1, 2 suspect 3, move to {1, 2} At the same time, 3, 4, 5 move to {3, 4, 5} Two primaries!

8 If failure Occurs in the Middle... Some suggested algorithms are wrong –See example on previous slide Correct algorithms differ in 1. How fast they recover 2. How many processes need to reconnect to allow recovery Previous studies overlooked differences

9 The (Correct) Algorithms 1-Pending –Use 2-phase-commit –Single point of failure Majority-Resilient 1-Pending (MR1P) –Use 3-phase-commit –Recovery takes 2 communication steps –Then, algorithm for new primary takes 2 more YKD - pipeline recovery and new algorithm

10 Our Study Simulations No failures, only partitions –Same probability to being faulty as detached –Advantage to 2-phase commit Multiple frequent connectivity changes –Inject fault with fixed probability in each step Then, stable period - see if primary exists

11 Observations Algorithms differ greatly in availability Especially in their degradation as changes become more frequent 2-phase commit suffers due to single point of failure 3-phase commit suffers because it’s more likely to be interrupted YKD - no degradation in lengthy executions

12 Conclusion Analysis of any kind may fail to consider important cases... General lessons: –Minimize number of processes needed for recovery –Speed up recovery by pipelining / parallelizing multiple instances