Download presentation
Presentation is loading. Please wait.
1
ECE 753: FAULT-TOLERANT COMPUTING
4/7/2019 ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Reconfiguration
2
ECE 753 Fault Tolerant Computing
4/7/2019 Overview Introduction and basic concept Fault model and fault coverage Two example architectures n-cubes de Bruijn networks Summary ECE 753 Fault Tolerant Computing
3
Introduction and basic concept
4/7/2019 Introduction and basic concept References Text and some other material Basic concept Must avoid using the faulty unit(s) – whether it be a process, processor, program, data, link between a pair of units, etc. Two types of re-configurations Fault tolerance via degraded performance fault tolerance provided by sufficient redundancy at design stage ECE 753 Fault Tolerant Computing
4
Fault model and fault coverage
4/7/2019 Fault model and fault coverage Candidate architectures Bus bases systems Crossbar based systems Mash connected systems Hypercube networks de Bruijn Networks Tree networks Hexagonal networks Other regular architectures ECE 753 Fault Tolerant Computing
5
Fault model and Fault coverage (contd.)
4/7/2019 Fault model and Fault coverage (contd.) System Model Units are represented as nodes Interconnects are represented as links between nodes Failure models Nodes may fail or go down – the corresponding unit unable to interact with other units Interconnect may fail or go down – no units can communicate using the failed or down link ECE 753 Fault Tolerant Computing
6
Fault model and Fault coverage (contd.)
4/7/2019 Fault model and Fault coverage (contd.) Objective of fault tolerance Any pair of units must be able to interact in the presence of Node failures Link failures Performance metrics How many faults (node or link failures) can be tolerated (fault coverage) Impact on the route length – number of hops between pairs of nodes (same as the length of the shortened path between a pair of nodes) Can pay attention to the worst case scenario or impact on the average length of the paths ECE 753 Fault Tolerant Computing
7
Two example architectures
4/7/2019 Two example architectures Hypercube architecture A n-cube Contains 2n nodes Encode the 2n nodes as n-tuples Two nodes are connected using a bi-directional link if and only if the Hamming distance between them is exactly 1 3-cube 2-cube 00 01 10 11 ECE 753 Fault Tolerant Computing
8
Two example architectures (contd.)
4/7/2019 Two example architectures (contd.) Hypercube architecture (contd.) A method of sending message between a pair of nodes Find a route between two nodes An algorithm for finding a route between nodes n1 and n2 Use binary encoding of n1 and n2 Let it be a1 a2 … ak and b1 b2 … bk Determine the locations these two string differ and complement one bit at a time to find a route between the two nodes Length of such a path can be no larger than k ECE 753 Fault Tolerant Computing
9
Two example architectures (contd.)
4/7/2019 Two example architectures (contd.) Hypercube architecture (contd.) Finding a route in the presence of a faulty node Consider an example – find path between nodes 0011 and 0101 in the presence of 0111 being faulty A possible path is 0011 0001 0101 Result: between every pair of nodes there are k node disjoint paths The paths are Complement one bit at a time starring from the left most bit and keeping it that way. Thus we will have n starts and these will lead to n disjoint paths with some careful construction of paths ECE 753 Fault Tolerant Computing
10
Two example architectures (contd.)
4/7/2019 Two example architectures (contd.) Hypercube architecture (contd.) In a hypercube of dimension k, upto k-1 node faults can be tolerated Some faults cause a degradation as the path length starts to increase after certain faults Number of link faults that can be tolerated is at least the number of tolerable node faults Problems that have been addresses in literature Centralized observer (as discussed above) Distributed algorithm in which every node knows the location of the faulty node Distributed algorithms in which only the neighbors of faulty node know its status ECE 753 Fault Tolerant Computing
11
Two example architectures (contd.)
4/7/2019 Two example architectures (contd.) de Bruijn networks Contains 2n nodes Encode the 2n nodes as n-tuples Two nodes are connected using a bi-directional link if and only if the second node can be derived by logical left or right shift of the first node An example de Bruijn network for k-3 is given next ECE 753 Fault Tolerant Computing
12
Two example architectures (contd.)
4/7/2019 Two example architectures (contd.) de Bruijn networks (contd.) 000 001 100 111 011 110 010 101 ECE 753 Fault Tolerant Computing
13
Two example architectures (contd.)
4/7/2019 Two example architectures (contd.) de Bruijn networks (contd.) There are at least two node disjoint paths between any pair of node Hence, in the presence of a single node failure nodes can continue to interact Many such results are known for de Bruijn networks ECE 753 Fault Tolerant Computing
14
ECE 753 Fault Tolerant Computing
4/7/2019 Summary Described two network architectures in which messages can be re-configured to maintain the network connectivity in the presence of faulty nodes and/or links ECE 753 Fault Tolerant Computing
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.