Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE.

Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems California Institute of Technology Marc Riedel

RAIN Project Collaboration: Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov

RAIN Platform switch bus network Heterogeneous network of nodes and switches node switch node

RAIN Testbed www.paradise.caltech.edu 10 Pentium boxes w/multiple NICs 4 eight-way Myrinet Switches

Proof of Concept: Video Server A A B B C C D D switch Video client & server on every node.

Limited Storage A A B B C C D D Insufficient storage to replicate all the data on each node. switch

k-of-n Code a d+c b d+a c a+b d b+c from any k of n columns b =a+ba+ d =d+cc+ abcd recover data Erasure-correcting code:

Encoding A A B B C C D D Encode video using 2-of-4 code. switch

Decoding A A B B C C D D Retrieve data and decode. switch

Node Failure A A B B C C D D switch

Node Failure A A B B C C D switch

Node Failure Dynamically switch to another node. A A B B C C D switch

Link Failure B B D C C A A switch

Link Failure A A B B C C D switch

Link Failure Dynamically switch to another network path. A A B B D C C switch

Switch Failure A A B B D C C switch

Switch Failure A A B B C C D switch

Switch Failure A A C C D Dynamically switch to another network path. B B switch

Node Recovery A A C C B B D D switch

Node Recovery A A C C B B D D switch Continuous reconfiguration (e.g., load-balancing).

Features tolerates multiple node/link/switch failures no single point of failure High availability: Certified Buzz- Word Compliant multiple data paths redundant storage graceful degradation Efficient use of resources: Dynamic scalability/reconfigurability

RAIN Project: Goals Networks Communication key building blocks Storage Applications Efficient, reliable distributed computing and storage systems:

Topics Fault-Tolerant Interconnect Topologies Connectivity Group Membership Distributed Storage Today’s Talk: Networks Communication Storage Applications

Interconnect Topologies = computing/storage node Network Goal: lose at most a constant number of nodes for given network loss N N N N N N N N N N N N N N N N N N N N N N

= computing/storage node Network N N N N N N N N N N N N N N N N N N N N N N Resistance to Partitions Large partitions problematic for distributed services/computation

Resistance to Partitions = computing/storage node Large partitions problematic for distributed services/computation N N N N N N N N N N N N N N N N N N N N N N Network

Related Work Hayes et al., Bruck et al., Boesch et al. Embedding hypercubes, rings, meshes, trees in fault-tolerant networks: Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems” Bus-based networks which are resistant to partitioning: IEEE ACM

A Ring of Switches N N S S S S S S S S S S S S S S N N N N N N N N N N N N = Node = Switch S S N N a naïve solution degree-2 compute nodes, degree-4 switches

= Node = Switch S S N N easily partitioned A Ring of Switches a naïve solution degree-2 compute nodes, degree-4 switches N N S S S S S S S S S S N N N N N N N N N N N N S S

Resistance to Partitioning 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 nodes on diagonals degree-2 compute nodes, degree-4 switches

nodes on diagonals degree-2 compute nodes, degree-4 switches tolerates any 3 switch failures (optimal) generalizes to arbitrary node/switch degrees. Resistance to Partitioning Details: paper IPPS’98, www.paradise.caltech.edu 2 2 3 3 5 5 7 7 8 8 2 2 3 3 4 4 5 5 7 7 8 8 1 1 46 6

1 1 1 1 2 2 3 3 4 4 5 5 7 7 8 8 2 2 3 3 4 4 5 5 6 6 7 7 8 8 6 6 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Resistance to Partitioning Isomorphic Details: paper IPPS’98, www.paradise.caltech.edu

Point-to-Point Connectivity Is the path from A to B up or down? ? node A B Network

Connectivity Link is seen as up or down by each node. Node A Node B {U,D} Bi-directional communication. Each node sends out pings. A node may time-out, deciding the link is down.

Consistent History AB U D U U U D D D Time Node State AB U D U U U D D D Time U U D Node State A B

The Slack Node State AB Time U D U U U D D D U U D A is 1 ahead A is 2 ahead Now A will wait for B to transition Slack n=2: at most 2 unacknowledged transitions before a node waits

Consistent History Consistency in error reporting: If A sees channel error, B sees channel error. Birman et al.: “Reliability Through Consistency” Node A Node B {U,D} Details: paper IPPS’99, www.paradise.caltech.edu

Group Membership B B A A D D C C ABCD link/node failures dynamic reconfiguration Consistent global view given local, point-to-point connectivity information

Related Work Totem, Isis/Horus, Transis Systems Theory Chandra et al., Impossibility of Group Membership in an Asynchronous Environment IEEE ACM

Token-Ring based Group Membership Protocol B B A A C C D D Group Membership

B B A A C C D D group membership list sequence number Token carries: Token-Ring based Group Membership Protocol 1: ABCD

Group Membership B B A A C C D D group membership list sequence number Token carries: Token-Ring based Group Membership Protocol 1: ABCD 1

Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 2: ABCD

Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 3 3: ABCD

Group Membership B B A A C C D D group membership list sequence number Token carries: 1 Token-Ring based Group Membership Protocol 2 34 4: ABCD

Group Membership B B A A C C D D group membership list sequence number Token carries: 5 Token-Ring based Group Membership Protocol 2 34

Group Membership B B A A C C D D 52 34 Node or link fails:

Group Membership B A A C C D D 5 34 Node or link fails:

Group Membership B A A C C D D ? 5 34 Node or link fails:

Group Membership B A A C C D D 5 34 If a node is inaccessible, it is excluded and bypassed. 5: ACD Node or link fails:

Group Membership B A A C C D D 5 64 If a node is inaccessible, it is excluded and bypassed. 6: ACD Node or link fails:

Group Membership B A A C C D D 5 67 If a node is inaccessible, it is excluded and bypassed. Node or link fails:

Group Membership B A A C C D D 5 67 Node with token fails:

Group Membership A A C C D 5 6 B Node with token fails:

Group Membership A A C C D 5 6 B ? ? Node with token fails:

Group Membership A A C C D 5 6 If the token is lost, it is regenerated. B ? ? Node with token fails:

Group Membership A A C C D 5 6 If the token is lost, it is regenerated. B Node with token fails:

Group Membership A A C C D 5 6 6: AD If the token is lost, it is regenerated. B 5: ACD Node with token fails:

Group Membership A A C C D 5 6 6: AD If the token is lost, it is regenerated. B 5: ACD Highest sequence number prevails. Node with token fails:

Group Membership A A C C D 7 6 Highest sequence number prevails. If the token is lost, it is regenerated. B Node with token fails:

Group Membership A A C C 7 6 Node recovers: B D D

Group Membership A A C C 7 6 B D D Recovering nodes are added. Node recovers:

Group Membership A A C C 7 6 B D D Recovering nodes are added. 7: ADC Node recovers:

Group Membership A A C C 7 6 B D D Recovering nodes are added. 8: ADC 8 Node recovers:

Group Membership A A C C 7 9 B D D Recovering nodes are added. 9: ADC 8 Node recovers:

Group Membership A A C C 10 B D D Recovering nodes are added. 98 Node recovers:

Group Membership A A C C 10 B D D 98 Unicast messages Dynamic reconfiguration Mean time-to-failure > convergence time Features: Details: publication forthcoming.

Distributed Storage disk 101001001000

Distributed Storage disk Focus: reliability and performance. disk 10101010111

Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. data redundancy “B-code”

Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. from any k of n columns b =a+ba+ d =d+cc+ abcd recover data

Array Codes a d+c b d+a c a+b d b+c Ideally suited for distributed storage. Low encoding/decoding complexity. abcd Details: IEEE Trans. Info Theory, www.paradise.caltech.edu B-Code and X-Code: optimally redundant optimal encoding/decoding complexity

Summary 1 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Fault-tolerant Interconnect Topologies Connectivity A B {U,D} Group Membership B A C D 1: ABCD 12 34 2: ABCD 3: ABCD 4: ABCD Distributed Storage a d+c b d+a c a+b d b+c

Proof-of-Concept Applications RAINVideo High-availability video server RAINCheck Distributed checkpoint rollback/recovery system SNOW Stable Network of Webservers

Rainfinity www.rainfinity.com Start-up based on RAIN technology availability scalability performance Clustered solutions for Internet data centers, focusing on: Business Plan:

Rainfinity Founded Sept. 1998 Released first product April 1999 Received $15 million funding in Dec. 1999 Now over 50 employees www.rainfinity.com Start-up based on RAIN technology Company:

Future Research Development of API’s Fault-Tolerant Distributed Filesystem Fault-Tolerant MPI/PVM implementation

End of Talk Material that was cut...

Erasure Correcting Codes data k 1 01011010 00 10 encoded data n Strategy: encode data with an erasure-correcting code.

Erasure Correcting Codes k 1 01011010 00 10 n lose up to m coordinates data Strategy: encode data with an erasure-correcting code.

Erasure Correcting Codes 1 01011010 00 10 n reconstructed data k Strategy: encode data with an erasure-correcting code. lose up to m coordinates k data

Erasure Correcting Codes Code is optimally-redundant (MDS) if. Example: Reed-Solomon code. 1 01011010 00 10 n reconstructed data k lose up to m coordinates k data

RAIN: Distributed Store a d+c b d+a c a+b d b+c disk a d+c b d+a c a+b d b+c Encode data with (n, k) array code Store one symbol per node

RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c Retrieve encoded data from any k nodes Reconstruct data abcd

RAIN: Distributed Retrieve disk a d+c disk b d+a disk c a+b disk d b+c abcd Reliability (similar to RAID systems)

RAIN: Distributed Retrieve disk a d+c disk c a+b disk Reliability (similar to RAID systems) abcd Performance: load- balancing

RAIN: Distributed Retrieve disk a d+c disk c a+b disk d b+c abcd Reliability (similar to RAID systems) Performance: load- balancing

RAIN: Distributed Retrieve disk a d+c disk c a+b disk d b+c busy! abcd Reliability (similar to RAID systems) Performance: load- balancing

Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE.

Similar presentations

Presentation on theme: "Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE.

Similar presentations

Presentation on theme: "Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck May 5, 2000 IEEE."— Presentation transcript:

Similar presentations

About project

Feedback