Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE.

Similar presentations


Presentation on theme: "Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE."— Presentation transcript:

1

2 Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems California Institute of Technology Marc Riedel

3 RAIN Project Collaboration: Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov

4 RAIN Platform switch bus network Heterogeneous network of nodes and switches node switch node

5 RAIN Testbed www.paradise.caltech.edu 10 Pentium boxes w/multiple NICs 4 eight-way Myrinet Switches

6 Proof of Concept: Video Server A A B B C C D D switch Video client & server on every node.

7 Limited Storage A A B B C C D D Insufficient storage to replicate all the data on each node. switch

8 k-of-n Code a d+c b d+a c a+b d b+c from any k of n columns b =a+ba+ d =d+cc+ abcd recover data Erasure-correcting code:

9 Encoding A A B B C C D D Encode video using 2-of-4 code. switch

10 Decoding A A B B C C D D Retrieve data and decode. switch

11 Node Failure A A B B C C D D switch

12 Node Failure A A B B C C D switch

13 Node Failure Dynamically switch to another node. A A B B C C D switch

14 Link Failure B B D C C A A switch

15 Link Failure A A B B C C D switch

16 Link Failure Dynamically switch to another network path. A A B B D C C switch

17 Switch Failure A A B B D C C switch

18 Switch Failure A A B B C C D switch

19 Switch Failure A A C C D Dynamically switch to another network path. B B switch

20 Node Recovery A A C C B B D D switch

21 Node Recovery A A C C B B D D switch Continuous reconfiguration (e.g., load-balancing).

22 Features tolerates multiple node/link/switch failures no single point of failure High availability: Certified Buzz- Word Compliant multiple data paths redundant storage graceful degradation Efficient use of resources: Dynamic scalability/reconfigurability

23 RAIN Project: Goals Networks Communication key building blocks Storage Applications Efficient, reliable distributed computing and storage systems:

24 Topics Fault-Tolerant Interconnect Topologies Connectivity Group Membership Distributed Storage Today’s Talk: Networks Communication Storage Applications

25 Interconnect Topologies = computing/storage node Network Goal: lose at most a constant number of nodes for given network loss N N N N N N N N N N N N N N N N N N N N N N

26 = computing/storage node Network N N N N N N N N N N N N N N N N N N N N N N Resistance to Partitions Large partitions problematic for distributed services/computation

27 Resistance to Partitions = computing/storage node Large partitions problematic for distributed services/computation N N N N N N N N N N N N N N N N N N N N N N Network

28 Related Work Hayes et al., Bruck et al., Boesch et al. Embedding hypercubes, rings, meshes, trees in fault-tolerant networks: Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems” Bus-based networks which are resistant to partitioning: IEEE ACM

29 A Ring of Switches N N S S S S S S S S S S S S S S N N N N N N N N N N N N = Node = Switch S S N N a naïve solution degree-2 compute nodes, degree-4 switches

30 A Ring of Switches N N S S S S S S S S S S S S S S N N N N N N N N N N N N = Node = Switch S S N N a naïve solution degree-2 compute nodes, degree-4 switches

31 = Node = Switch S S N N easily partitioned A Ring of Switches a naïve solution degree-2 compute nodes, degree-4 switches N N S S S S S S S S S S N N N N N N N N N N N N S S

32 Resistance to Partitioning 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 nodes on diagonals degree-2 compute nodes, degree-4 switches

33 Resistance to Partitioning 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 nodes on diagonals degree-2 compute nodes, degree-4 switches

34 nodes on diagonals degree-2 compute nodes, degree-4 switches tolerates any 3 switch failures (optimal) generalizes to arbitrary node/switch degrees. Resistance to Partitioning Details: paper IPPS’98, www.paradise.caltech.edu 2 2 3 3 5 5 7 7 8 8 2 2 3 3 4 4 5 5 7 7 8 8 1 1 46 6

35 1 1 1 1 2 2 3 3 4 4 5 5 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 6 6 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Resistance to Partitioning Isomorphic Details: paper IPPS’98, www.paradise.caltech.edu

36 Point-to-Point Connectivity Is the path from A to B up or down? ? node A B Network

37 Connectivity Link is seen as up or down by each node. Node A Node B {U,D} Bi-directional communication. Each node sends out pings. A node may time-out, deciding the link is down.

38 Consistent History AB U D U U U D D D Time Node State AB U D U U U D D D Time U U D Node State A B

39 The Slack Node State AB Time U D U U U D D D U U D A is 1 ahead A is 2 ahead Now A will wait for B to transition Slack n=2: at most 2 unacknowledged transitions before a node waits

40 Consistent History Consistency in error reporting: If A sees channel error, B sees channel error. Birman et al.: “Reliability Through Consistency” Node A Node B {U,D} Details: paper IPPS’99, www.paradise.caltech.edu

41 Group Membership B B A A D D C C ABCD link/node failures dynamic reconfiguration Consistent global view given local, point-to-point connectivity information

42 Related Work Totem, Isis/Horus, Transis Systems Theory Chandra et al., Impossibility of Group Membership in an Asynchronous Environment IEEE ACM

43 Token-Ring based Group Membership Protocol B B A A C C D D Group Membership

44 B B A A C C D D group membership list sequence number Token carries: Token-Ring based Group Membership Protocol 1: ABCD

45 Group Membership B B A A C C D D group membership list sequence number Token carries: Token-Ring based Group Membership Protocol 1: ABCD 1

46 Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 2: ABCD

47 Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 3 3: ABCD

48 Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 34 4: ABCD

49 Group Membership B B A A C C D D group membership list sequence number Token carries: 5 Token-Ring based Group Membership Protocol 2 34

50 Group Membership B B A A C C D D 52 34 Node or link fails:

51 Group Membership B A A C C D D 5 34 Node or link fails:

52 Group Membership B A A C C D D ? 5 34 Node or link fails:

53 Group Membership B A A C C D D ? 5 34 Node or link fails:

54 Group Membership B A A C C D D 5 34 If a node is inaccessible, it is excluded and bypassed. 5: ACD Node or link fails:

55 Group Membership B A A C C D D 5 64 If a node is inaccessible, it is excluded and bypassed. 6: ACD Node or link fails:

56 Group Membership B A A C C D D 5 67 If a node is inaccessible, it is excluded and bypassed. Node or link fails:

57 Group Membership B A A C C D D 5 67 If a node is inaccessible, it is excluded and bypassed. Node or link fails:

58 Group Membership B A A C C D D 5 67 Node with token fails:

59 Group Membership A A C C D 5 6 B Node with token fails:

60 Group Membership A A C C D 5 6 B ? ? Node with token fails:

61 Group Membership A A C C D 5 6 If the token is lost, it is regenerated. B ? ? Node with token fails:

62 Group Membership A A C C D 5 6 If the token is lost, it is regenerated. B Node with token fails:

63 Group Membership A A C C D 5 6 6: AD If the token is lost, it is regenerated. B 5: ACD Node with token fails:

64 Group Membership A A C C D 5 6 6: AD If the token is lost, it is regenerated. B 5: ACD Highest sequence number prevails. Node with token fails:

65 Group Membership A A C C D 7 6 Highest sequence number prevails. If the token is lost, it is regenerated. B Node with token fails:

66 Group Membership A A C C 7 6 Node recovers: B D D

67 Group Membership A A C C 7 6 B D D Recovering nodes are added. Node recovers:

68 Group Membership A A C C 7 6 B D D Recovering nodes are added. 7: ADC Node recovers:

69 Group Membership A A C C 7 6 B D D Recovering nodes are added. 8: ADC 8 Node recovers:

70 Group Membership A A C C 7 9 B D D Recovering nodes are added. 9: ADC 8 Node recovers:

71 Group Membership A A C C 10 B D D Recovering nodes are added. 98 Node recovers:

72 Group Membership A A C C 10 B D D 98 Unicast messages Dynamic reconfiguration Mean time-to-failure > convergence time Features: Details: publication forthcoming.

73 Distributed Storage disk 101001001000

74 Distributed Storage disk Focus: reliability and performance. disk 10101010111

75 Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. data redundancy “B-code”

76 Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. from any k of n columns b =a+ba+ d =d+cc+ abcd recover data

77 Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. abcd Details: IEEE Trans. Info Theory, www.paradise.caltech.edu B-Code and X-Code: optimally redundant optimal encoding/decoding complexity

78 Summary 1 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Fault-tolerant Interconnect Topologies Connectivity A B {U,D} Group Membership B A C D 1: ABCD 12 34 2: ABCD 3: ABCD 4: ABCD Distributed Storage a d+c b d+a c a+b d b+c

79 Proof-of-Concept Applications RAINVideo High-availability video server RAINCheck Distributed checkpoint rollback/recovery system SNOW Stable Network of Webservers

80 Rainfinity www.rainfinity.com Start-up based on RAIN technology availability scalability performance Clustered solutions for Internet data centers, focusing on: Business Plan:

81 Rainfinity Founded Sept. 1998 Released first product April 1999 Received $15 million funding in Dec. 1999 Now over 50 employees www.rainfinity.com Start-up based on RAIN technology Company:

82 Future Research Development of API’s Fault-Tolerant Distributed Filesystem Fault-Tolerant MPI/PVM implementation

83 End of Talk Material that was cut...

84 Erasure Correcting Codes data k 1 01011010 00 10 encoded data n Strategy: encode data with an erasure-correcting code.

85 Erasure Correcting Codes k 1 01011010 00 10 n lose up to m coordinates data Strategy: encode data with an erasure-correcting code.

86 Erasure Correcting Codes 1 01011010 00 10 n reconstructed data k Strategy: encode data with an erasure-correcting code. lose up to m coordinates k data

87 Erasure Correcting Codes Code is optimally-redundant (MDS) if. Example: Reed-Solomon code. 1 01011010 00 10 n reconstructed data k lose up to m coordinates k data

88 RAIN: Distributed Store a d+c b d+a c a+b d b+c disk a d+c b d+a c a+b d b+c Encode data with (n, k) array code Store one symbol per node

89 RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c Retrieve encoded data from any k nodes Reconstruct data abcd

90 RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c abcd Reliability (similar to RAID systems)

91 RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c abcd Reliability (similar to RAID systems)

92 RAIN: Distributed Retrieve disk a d+c disk c a+b disk Reliability (similar to RAID systems) abcd Performance: load- balancing

93 RAIN: Distributed Retrieve disk a d+c disk c a+b disk d b+c abcd Reliability (similar to RAID systems) Performance: load- balancing

94 RAIN: Distributed Retrieve disk a d+c disk c a+b disk d b+c busy! abcd Reliability (similar to RAID systems) Performance: load- balancing


Download ppt "Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE."

Similar presentations


Ads by Google