Fault Tolerance Computer Programs that Can Fix Themselves! Prof’s Paul Sivilotti and Tim Long Dept. of Computer & Info. Science The Ohio State University.

Slides:



Advertisements
Similar presentations
Computer Science and Engineering The Ohio State University To Ponder Does a problem get easier or harder to solve if I give you less information?
Advertisements

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Data and Computer Communications
CS 542: Topics in Distributed Systems Diganta Goswami.
Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Teaser - Introduction to Distributed Computing
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Lecture 4: Elections, Reset Anish Arora CSE 763 Notes include material from Dr. Jeff Brumfield.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Protocols. Basics Defining Interactions VERTICAL Application Presentation Session Transport Network Data Link Physical Please do this for me OK It’s.
Concurrent Processes Lecture 5. Introduction Modern operating systems can handle more than one process at a time System scheduler manages processes and.
1 Availability Study of Dynamic Voting Algorithms Kyle Ingols and Idit Keidar MIT Lab for Computer Science.
Dissuasive Methods Against Cheaters in Distributed Systems Kévin Huguenin Ph.D. defense, December 10 th 2010 TexPoint fonts used in EMF. Read the TexPoint.
Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering 3 October 2007.
CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems
CS603 Process Synchronization February 11, Synchronization: Basics Problem: Shared Resources –Generally data –But could be others Approaches: –Model.
Last Class: Weak Consistency
CS352- Link Layer Dept. of Computer Science Rutgers University.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
GS 3 GS 3 : Scalable Self-configuration and Self-healing in Wireless Networks Hongwei Zhang & Anish Arora.
1 Chapter 10 Introduction to Metropolitan Area Networks and Wide Area Networks Data Communications and Computer Networks: A Business User’s Approach.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
This is the way an organisation distributes the data across its network. It uses different types of networks to communicate the information across it.
Wireless Sensor Networks Self-Healing Professor Jack Stankovic University of Virginia 2005.
Introduction to Programming Prof. Rommel Anthony Palomino Department of Computer Science and Information Technology Spring 2011.
CPSC 171 Introduction to Computer Science 3 Levels of Understanding Algorithms More Algorithm Discovery and Design.
Section 4 : The OSI Network Layer CSIS 479R Fall 1999 “Network +” George D. Hickman, CNI, CNE.
Algorithms and Programming
1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.
Andreas Larsson, Philippas Tsigas SIROCCO Self-stabilizing (k,r)-Clustering in Clock Rate-limited Systems.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Distributed Systems: Concepts and Design Chapter 1 Pages
1 Next Few Classes Networking basics Protection & Security.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Computer Science & Engineering An Introduction (and Some Advanced Concepts Too!) Prof. Paul Sivilotti Dept. of Computer Science & Engineering The Ohio.
Introducing 8 th Grade Girls to Fault Tolerant Computing (An Experience Report) Paul A.G. Sivilotti Murat Demirbas Dept. of Computer & Info. Science The.
The McGraw- AS Computing LAN Topologies. The McGraw- Categories of LAN Topology.
Computer Science An Introduction and Some Advanced Concepts Too! Prof. Paul Sivilotti Dept. of Computer & Info. Science The Ohio State University
Data Communications and Networking Chapter 11 Routing in Switched Networks References: Book Chapters 12.1, 12.3 Data and Computer Communications, 8th edition.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
CS 1120: Computer Science II Software Life Cycle Slides courtesy of: Prof. Ajay Gupta and Prof. James Yang (format and other minor modifications by by.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Computer Network Architecture Lecture 2: Fundamental of Network.
Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)
Distributed Mutual Exclusion Synchronization in Distributed Systems Synchronization in distributed systems are often more difficult compared to synchronization.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
COMPUTER NETWORKS Lecture-8 Husnain Sherazi. Review Lecture 7  Shared Communication Channel  Locality of Reference Principle  LAN Topologies – Star.
Testing All Input is Evil pt 2.
The University of Adelaide, School of Computer Science
Malwarebytes System Scanning Error
Chapter 6.3 Mutual Exclusion
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
EEC 688/788 Secure and Dependable Computing
Link Layer and LANs Not everyone is meant to make a difference. But for me, the choice to lead an ordinary life is no longer an option 5: DataLink Layer.
EEC 688/788 Secure and Dependable Computing
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Testing.
Distributed Error- Confinement
EEC 688/788 Secure and Dependable Computing
Testing.
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Fault Tolerance Computer Programs that Can Fix Themselves! Prof’s Paul Sivilotti and Tim Long Dept. of Computer & Info. Science The Ohio State University

Ubiquity of Computers

Distributed Systems Computers are increasingly connected Can cooperate to solve a common task Example: Voting day Ballots counted in individual polling stations District office sums these counts State office sums the district numbers Example: Traffic flow Calculate best route from OSU to downtown Columbus Based on: roads, distances, construction information, traffic jams, other cars,…

Distributed Programs: Advantages Speed Divide a big problem into smaller parts Solve smaller parts in parallel Communication Problem is inherently distributed across space Must gather information and coordinate Robustness Give the same task to many computers Even if one fails, others are probably ok

So What’s the Downside? With more computers, there are more things that can go wrong! Despite our best efforts, “faults” occur Examples: “blue screen”, “segmentation fault”, “crash”,… Possible causes: Environmental conditions Computer unplugged, wireless connection lost Software bug Human error in engineering the software User error Program given bad input, used improperly

Fault-Tolerant Software A program that still does the right thing, despite faults Can fix itself, and eventually recover For sequential software Restart system But for distributed software? Restart isn’t practical! System must be “self-stabilizing”

Tokens for Taking Turns Consider all the chefs across the city Say they need to take turns Only one can be on vacation at a time How do they coordinate when to go on vacation? Solution: use a “token” Pass token around Rule: If you have the token, you can go on vacation

Token Ring Problem: What about faults? What happens if token is lost? One fault means disaster!

Prevent Loss of Token: Binary Ring Rule (for most chefs): if left neighbor is different from me, then I have the token Make my number equal to that neighbor’s

Completing the Ring One chef is special: if left neighbor is same as me, then I have the token Make my number differ from that neighbor’s

Fault: Corruption of Values Problem: multiple tokens in ring Tokens chase each other around ring One fault means disaster “1” token “0” token

k-State Token Ring Solution: use more values than chefs! Same rule If left neighbor different from me: I have the token! (use it) Change my value to be equal to neighbor Again, one chef is special If left neighbor same as me I have the token (use it) Change my value to be one bigger

Activity: Anthropomorphism Form a ring Each person has number cards Each person has a chime When you get the token: Play your chime Then change your number We’ll run different versions I’ll introduce “faults” and see if you can recover!

Can We Do Better? Problems with this approach? 1. Consider a very large ring Need a very large number of states! 2. Consider a dynamic ring When the size changes, everyone needs to be updated! A better solution would use a fixed number of states Independent of ring size

Constant State Size Recall difficulty with binary ring Many tokens in ring, of different “types” Chase each other around the ring, never colliding Solution A: Force them to “collide” by having more types than chefs Solution B?

Don’t Use a Ring! Solution: chop the ring! Tokens can’t circulate They reflect back and forth All chefs responsible for stabilization

4-State Token “Ring” Chefs pass tokens right and left State = value (0/1) and direction (left/right) Left rule (same as before) If left neighbor different from me: I have the token! (use it) Change my value to be equal to neighbor Face right Right rule (other direction) If right neighbor same as me and facing me: I have the token! (use it) Face left

Activity 2 Form a “ring” again Pairs at each station One person for left rule, one for right Left rule person: Always “on” Copy value and face right (may already be facing that way) Right rule person: Only “on” when you and right neighbor face each other Change direction to face left

Take-Home Messages Ubiquity of Distributed Systems Computers are getting smaller, cheaper, faster Increased connectivity Faults Some errors are outside of our control Sequential programs can be reset Distributed Fault Tolerance Distributed programs that fix themselves Eventually stabilize in spite of faults Subtlety of Distributed Algorithms Concurrency and nondeterminism How can we convince ourselves these algorithms are correct?