Download presentation
Presentation is loading. Please wait.
2
Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems California Institute of Technology Marc Riedel
3
RAIN Project Collaboration: Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov
4
RAIN Platform switch bus network Heterogeneous network of nodes and switches node switch node
5
RAIN Testbed www.paradise.caltech.edu 10 Pentium boxes w/multiple NICs 4 eight-way Myrinet Switches
6
Proof of Concept: Video Server A A B B C C D D switch Video client & server on every node.
7
Limited Storage A A B B C C D D Insufficient storage to replicate all the data on each node. switch
8
k-of-n Code a d+c b d+a c a+b d b+c from any k of n columns b =a+ba+ d =d+cc+ abcd recover data Erasure-correcting code:
9
Encoding A A B B C C D D Encode video using 2-of-4 code. switch
10
Decoding A A B B C C D D Retrieve data and decode. switch
11
Node Failure A A B B C C D D switch
12
Node Failure A A B B C C D switch
13
Node Failure Dynamically switch to another node. A A B B C C D switch
14
Link Failure B B D C C A A switch
15
Link Failure A A B B C C D switch
16
Link Failure Dynamically switch to another network path. A A B B D C C switch
17
Switch Failure A A B B D C C switch
18
Switch Failure A A B B C C D switch
19
Switch Failure A A C C D Dynamically switch to another network path. B B switch
20
Node Recovery A A C C B B D D switch
21
Node Recovery A A C C B B D D switch Continuous reconfiguration (e.g., load-balancing).
22
Features tolerates multiple node/link/switch failures no single point of failure High availability: Certified Buzz- Word Compliant multiple data paths redundant storage graceful degradation Efficient use of resources: Dynamic scalability/reconfigurability
23
RAIN Project: Goals Networks Communication key building blocks Storage Applications Efficient, reliable distributed computing and storage systems:
24
Topics Fault-Tolerant Interconnect Topologies Connectivity Group Membership Distributed Storage Today’s Talk: Networks Communication Storage Applications
25
Interconnect Topologies = computing/storage node Network Goal: lose at most a constant number of nodes for given network loss N N N N N N N N N N N N N N N N N N N N N N
26
= computing/storage node Network N N N N N N N N N N N N N N N N N N N N N N Resistance to Partitions Large partitions problematic for distributed services/computation
27
Resistance to Partitions = computing/storage node Large partitions problematic for distributed services/computation N N N N N N N N N N N N N N N N N N N N N N Network
28
Related Work Hayes et al., Bruck et al., Boesch et al. Embedding hypercubes, rings, meshes, trees in fault-tolerant networks: Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems” Bus-based networks which are resistant to partitioning: IEEE ACM
29
A Ring of Switches N N S S S S S S S S S S S S S S N N N N N N N N N N N N = Node = Switch S S N N a naïve solution degree-2 compute nodes, degree-4 switches
30
A Ring of Switches N N S S S S S S S S S S S S S S N N N N N N N N N N N N = Node = Switch S S N N a naïve solution degree-2 compute nodes, degree-4 switches
31
= Node = Switch S S N N easily partitioned A Ring of Switches a naïve solution degree-2 compute nodes, degree-4 switches N N S S S S S S S S S S N N N N N N N N N N N N S S
32
Resistance to Partitioning 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 nodes on diagonals degree-2 compute nodes, degree-4 switches
33
Resistance to Partitioning 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 nodes on diagonals degree-2 compute nodes, degree-4 switches
34
nodes on diagonals degree-2 compute nodes, degree-4 switches tolerates any 3 switch failures (optimal) generalizes to arbitrary node/switch degrees. Resistance to Partitioning Details: paper IPPS’98, www.paradise.caltech.edu 2 2 3 3 5 5 7 7 8 8 2 2 3 3 4 4 5 5 7 7 8 8 1 1 46 6
35
1 1 1 1 2 2 3 3 4 4 5 5 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 6 6 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Resistance to Partitioning Isomorphic Details: paper IPPS’98, www.paradise.caltech.edu
36
Point-to-Point Connectivity Is the path from A to B up or down? ? node A B Network
37
Connectivity Link is seen as up or down by each node. Node A Node B {U,D} Bi-directional communication. Each node sends out pings. A node may time-out, deciding the link is down.
38
Consistent History AB U D U U U D D D Time Node State AB U D U U U D D D Time U U D Node State A B
39
The Slack Node State AB Time U D U U U D D D U U D A is 1 ahead A is 2 ahead Now A will wait for B to transition Slack n=2: at most 2 unacknowledged transitions before a node waits
40
Consistent History Consistency in error reporting: If A sees channel error, B sees channel error. Birman et al.: “Reliability Through Consistency” Node A Node B {U,D} Details: paper IPPS’99, www.paradise.caltech.edu
41
Group Membership B B A A D D C C ABCD link/node failures dynamic reconfiguration Consistent global view given local, point-to-point connectivity information
42
Related Work Totem, Isis/Horus, Transis Systems Theory Chandra et al., Impossibility of Group Membership in an Asynchronous Environment IEEE ACM
43
Token-Ring based Group Membership Protocol B B A A C C D D Group Membership
44
B B A A C C D D group membership list sequence number Token carries: Token-Ring based Group Membership Protocol 1: ABCD
45
Group Membership B B A A C C D D group membership list sequence number Token carries: Token-Ring based Group Membership Protocol 1: ABCD 1
46
Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 2: ABCD
47
Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 3 3: ABCD
48
Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 34 4: ABCD
49
Group Membership B B A A C C D D group membership list sequence number Token carries: 5 Token-Ring based Group Membership Protocol 2 34
50
Group Membership B B A A C C D D 52 34 Node or link fails:
51
Group Membership B A A C C D D 5 34 Node or link fails:
52
Group Membership B A A C C D D ? 5 34 Node or link fails:
53
Group Membership B A A C C D D ? 5 34 Node or link fails:
54
Group Membership B A A C C D D 5 34 If a node is inaccessible, it is excluded and bypassed. 5: ACD Node or link fails:
55
Group Membership B A A C C D D 5 64 If a node is inaccessible, it is excluded and bypassed. 6: ACD Node or link fails:
56
Group Membership B A A C C D D 5 67 If a node is inaccessible, it is excluded and bypassed. Node or link fails:
57
Group Membership B A A C C D D 5 67 If a node is inaccessible, it is excluded and bypassed. Node or link fails:
58
Group Membership B A A C C D D 5 67 Node with token fails:
59
Group Membership A A C C D 5 6 B Node with token fails:
60
Group Membership A A C C D 5 6 B ? ? Node with token fails:
61
Group Membership A A C C D 5 6 If the token is lost, it is regenerated. B ? ? Node with token fails:
62
Group Membership A A C C D 5 6 If the token is lost, it is regenerated. B Node with token fails:
63
Group Membership A A C C D 5 6 6: AD If the token is lost, it is regenerated. B 5: ACD Node with token fails:
64
Group Membership A A C C D 5 6 6: AD If the token is lost, it is regenerated. B 5: ACD Highest sequence number prevails. Node with token fails:
65
Group Membership A A C C D 7 6 Highest sequence number prevails. If the token is lost, it is regenerated. B Node with token fails:
66
Group Membership A A C C 7 6 Node recovers: B D D
67
Group Membership A A C C 7 6 B D D Recovering nodes are added. Node recovers:
68
Group Membership A A C C 7 6 B D D Recovering nodes are added. 7: ADC Node recovers:
69
Group Membership A A C C 7 6 B D D Recovering nodes are added. 8: ADC 8 Node recovers:
70
Group Membership A A C C 7 9 B D D Recovering nodes are added. 9: ADC 8 Node recovers:
71
Group Membership A A C C 10 B D D Recovering nodes are added. 98 Node recovers:
72
Group Membership A A C C 10 B D D 98 Unicast messages Dynamic reconfiguration Mean time-to-failure > convergence time Features: Details: publication forthcoming.
73
Distributed Storage disk 101001001000
74
Distributed Storage disk Focus: reliability and performance. disk 10101010111
75
Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. data redundancy “B-code”
76
Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. from any k of n columns b =a+ba+ d =d+cc+ abcd recover data
77
Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. abcd Details: IEEE Trans. Info Theory, www.paradise.caltech.edu B-Code and X-Code: optimally redundant optimal encoding/decoding complexity
78
Summary 1 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Fault-tolerant Interconnect Topologies Connectivity A B {U,D} Group Membership B A C D 1: ABCD 12 34 2: ABCD 3: ABCD 4: ABCD Distributed Storage a d+c b d+a c a+b d b+c
79
Proof-of-Concept Applications RAINVideo High-availability video server RAINCheck Distributed checkpoint rollback/recovery system SNOW Stable Network of Webservers
80
Rainfinity www.rainfinity.com Start-up based on RAIN technology availability scalability performance Clustered solutions for Internet data centers, focusing on: Business Plan:
81
Rainfinity Founded Sept. 1998 Released first product April 1999 Received $15 million funding in Dec. 1999 Now over 50 employees www.rainfinity.com Start-up based on RAIN technology Company:
82
Future Research Development of API’s Fault-Tolerant Distributed Filesystem Fault-Tolerant MPI/PVM implementation
83
End of Talk Material that was cut...
84
Erasure Correcting Codes data k 1 01011010 00 10 encoded data n Strategy: encode data with an erasure-correcting code.
85
Erasure Correcting Codes k 1 01011010 00 10 n lose up to m coordinates data Strategy: encode data with an erasure-correcting code.
86
Erasure Correcting Codes 1 01011010 00 10 n reconstructed data k Strategy: encode data with an erasure-correcting code. lose up to m coordinates k data
87
Erasure Correcting Codes Code is optimally-redundant (MDS) if. Example: Reed-Solomon code. 1 01011010 00 10 n reconstructed data k lose up to m coordinates k data
88
RAIN: Distributed Store a d+c b d+a c a+b d b+c disk a d+c b d+a c a+b d b+c Encode data with (n, k) array code Store one symbol per node
89
RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c Retrieve encoded data from any k nodes Reconstruct data abcd
90
RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c abcd Reliability (similar to RAID systems)
91
RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c abcd Reliability (similar to RAID systems)
92
RAIN: Distributed Retrieve disk a d+c disk c a+b disk Reliability (similar to RAID systems) abcd Performance: load- balancing
93
RAIN: Distributed Retrieve disk a d+c disk c a+b disk d b+c abcd Reliability (similar to RAID systems) Performance: load- balancing
94
RAIN: Distributed Retrieve disk a d+c disk c a+b disk d b+c busy! abcd Reliability (similar to RAID systems) Performance: load- balancing
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.