Presentation is loading. Please wait.

Presentation is loading. Please wait.

Isis 2 Design Choices A few puzzles to think about when considering use of Isis 2 in your work.

Similar presentations


Presentation on theme: "Isis 2 Design Choices A few puzzles to think about when considering use of Isis 2 in your work."— Presentation transcript:

1 Isis 2 Design Choices A few puzzles to think about when considering use of Isis 2 in your work

2 A Service With Mobile Clients Suppose that you are creating a service that will have external clients using web apps or browsers. – Your goals are to load-balance the requests over your service, but your service depends upon some form of dynamically evolving replicated state. – The questions that follow relate to how best to use Isis 2 as a tool in solving this kind of problem A B C Your users are remote and mobile Your server will run in a cloud-hosted data center

3 A Service with Mobile Clients True or False: A good use of Isis 2 would be for direct communication of updates between the client systems.

4 A Service with Mobile Clients False: Isis 2 is poorly suited to P2P settings, where there can be a wide variety of communication barriers. The best use of Isis 2 is internal to a data center, where the server runs. A B C Direct peer to peer connectivity is often difficult due to firewalls, network address translation and slow links. This is an issue even within a single household! We can count on fairly good connections back to the hosted server in the data center

5 Client to Server Connectivity Which of these is not a good choice? A.Connect the clients to the data center using a prebuilt web services solutions, such as the RESTFUL service architecture. B.Employ Visual Studio and tell it you want to create a new WCF application. Build on the automatically created WCF client and server templates. C.Launch Isis 2 in all systems, but have the client applications use the built-in “Client of a group” API in Isis 2, and have the group run purely on nodes inside the data center.

6 Client to Server Connectivity C is a poor choice. By default, Isis 2 probably won’t even start correctly in this setting. – It uses IP multicast to find peers during its start protocol. – Using ISIS_HOSTS and ISIS_UNICAST_ONLY you can help Isis 2 start in this setting, but the overheads of doing so would be pretty high compared to the WCF or RESTFUL approach. – The “Client” API internal to Isis 2 is intended for cases where one group is using services from another group, not for mobile external users.

7 Isis 2 can help with… A.Maintaining seamless connectivity, so that the mobile users never see a disconnection. B.Maintaining the game state, so that every user sees a consistent, dynamically updated state even when connected to different server instances. C.Real-time coordination, so that activities like multiuser battles are easier to script.

8 Isis 2 can help with… A.Maintaining seamless connectivity, so that the mobile users never see a disconnection. B.Maintaining the game state, so that every user sees a consistent, dynamically updated state even when connected to different server instances. C.Real-time coordination, so that activities like multiuser battles are easier to script. Isis 2 won’t even know about the network links, which will probably use TCP. The Cornell TCP-R technology offers unbreakable TCP links. You could deploy it side by side with Isis 2 to create seamless connectivity Isis 2 is fast, but it is not a real-time technology. By synchronizing clocks (e.g. using NTP with a good- quality NTP stratum 0 time source) on your servers, you could employ Isis 2 as part of a real-time system

9 The best option for guaranteed actions is… Suppose a mobile user does some action and we want to guarantee that it will be performed exactly once. We’re running Isis 2 within our data center on the game servers. A.Isis 2 can automatically handle this through a form of primary-backup coordination B.Isis 2 lacks a solution to this but provides tools that can be used to create a solution in any of several ways, depending on your specific goals.

10 The best option for guaranteed actions is… Suppose a mobile user does some action and we want to guarantee that it will be performed exactly once. We’re running Isis 2 within our data center on the game servers. A.Isis 2 can automatically handle this through a form of primary-backup coordination B.Isis 2 lacks a solution to this but provides tools that can be used to create a solution in any of several ways, depending on your specific goals. Isis 2 won’t even know about the incoming requests since they will arrive as WCF or REST events, delivered as upcalls to individual group members. Also, Isis 2 lacks a built in “do this fault-tolerantly” option.

11 Take an action fault-tolerantly Suppose our group has members {P,Q,R} Some request arrives at member P from a client, and we wish to perform it exactly once even if failures occur. Which option is best? A.P should relay the request to the whole group, e.g. using g.OrderedSend(). If a client timeout occurs, the client can reissue the request B.We will need to use the Isis 2 g.SafeSend() disk durability option to solve this problem.

12 Take an action fault-tolerantly Suppose our group has members {P,Q,R} Some request arrives at member P from a client, and we wish to perform it exactly once even if failures occur. Which option is best? A.P should relay the request to the whole group, e.g. using g.OrderedSend(). If a client timeout occurs, the client can reissue the request B.We will need to use the Isis 2 g.SafeSend() disk durability option to solve this problem. The SafeSend() protocol in Isis 2 is used when a group is employed as a “wrapper” around replicas of a database external to the group (e.g. a replicated mySQL database, or an Oracle database). For gaming applications, running over a replicated durable database would be too slow, so this is not a good design for the application we have In mind.

13 Relaying a Request In the previous question we decided that P should relay the request, but if P fails, that the mobile client system might reissue it. A.In this situation, Isis 2 would automatically sense a reissued request. Thus if P uses OrderedSend to relay client request X, but then the client asks Q to relay the same request, it will only be delivered once. B.Isis 2 cannot sense this form of duplication. Application code of your own would be needed to sense duplicate requests and perform them just once.

14 Relaying a Request In the previous question we decided that P should relay the request, but if P fails, that the mobile client system might reissue it. A.In this situation, Isis 2 would automatically sense a reissued request. Thus if P uses OrderedSend to relay client request X, but then the client asks Q to relay the same request, it will only be delivered once. B.Isis 2 cannot sense this form of duplication. Application code of your own would be needed to sense duplicate requests and perform them just once. When designing your gaming application, give each request a unique id. Then, if the group receives a duplicated request, you can just replay the same response under the assumption that the mobile application timeout out for some reason and missed the original response.

15 Sending Failures The best way to sense failures would be A.Let Isis 2 do this automatically. You are unlikely to do better and Isis 2 will be very fast in any case. B.One by one ask what failures can occur. For each case try and design a super-fast failure handling solution, which could include telling Isis 2 that one of the group members has failed. C.Connect your service to the Amazon EC2 fault sensing and reporting framework.

16 Sending Failures The best way to sense failures would be A.Let Isis 2 do this automatically. You are unlikely to do better and Isis 2 will be very fast in any case. B.One by one ask what failures can occur. For each case try and design a super-fast failure handling solution, which could include telling Isis 2 that one of the group members has failed. C.Connect your service to the Amazon EC2 fault sensing and reporting framework. Isis 2 rapidly senses and resends lost messages internal to the data center, so that one case will be handled automatically. But outright failures of the group members will be sensed slowly, after 45-90s by default. Surprisingly, there is no EC2 fault sensing and reporting framework. Most gaming applications end up designing a rapid sensing framework of their own.

17 Real-Time In Isis 2 Your gaming system needs a kind of real-time “pulse” that will trigger periodic actions by all the members. But you want consistency! A.Have one leader track the time and then use g.Send() to trigger the pulse B.Same as A but use g.RawSend() for better speed C.Synchronize time across the whole group, and just have each group member take actions at the pre-agreed “pulse time” points

18 Real-Time In Isis 2 Your gaming system needs a kind of real-time “pulse” that will trigger periodic actions by all the members. But you want consistency! A.Have one leader track the time and then use g.Send() to trigger the pulse B.Same as A but use g.RawSend() for better speed C.Synchronize time across the whole group, and just have each group member take actions at the pre-agreed “pulse time” points The CAP theorem tells us that we have a tradeoff here. G.Send() is always consistent, and will normally be very fast. If consistency matters, this is probably the best way to achieve it. RawSend() won’t necessarily reach every member. So it has more steady timing on delivery, but some members might fail to pulse (e.g. if a message is lost – RawSend() won’t try to recover it). This gives the best timing but completely lacks any kind of strong consistency. Also, keep in mind that on shared, virtualized platforms like EC2, even with NTP one may have trouble synchronizing clocks to better than 25-50ms. By renting heavy-weight EC2 instances you can reduce the risk of disruptive scheduling delays

19 Duplicated Computing Certain gaming requests require fairly heavy computing. We want to have two group members perform each such request for fault- tolerance, but how should they be picked? A.Relay the request via OrderedSend, then on receipt, use the group view to select 2 members. They compute the identical answer because data is consistent and both reply. The client takes the first reply and ignores the duplicate. B.Have the external client just send the same request twice. Again, the client just takes the first reply.

20 Duplicated Computing Certain gaming requests require fairly heavy computing. We want to have two group members perform each such request for fault- tolerance, but how should they be picked? A.Relay the request via OrderedSend, then on receipt, use the group view to select 2 members. They compute the identical answer because data is consistent and both reply. The client takes the first reply and ignores the duplicate. B.Have the external client just send the same request twice. Again, the client just takes the first reply. For example, take the request-id and hash it to a number k  0…N-1. Then have group members k and k+1 (mod N) run the operation for this request. This could work, but keep in mind that the two requests might end up assigned to the same group member. It is hard to completely control the EC2 load-balancer!

21 TCP-R We mentioned the Cornell TCP-R technology. The role of TCP-R is: A.To allow a group member to “take over” a TCP endpoint seamlessly, thus allowing transparent fail- over or migration of computing roles. B.To enhance performance of TCP for real-time and gaming uses by changing flow-control behavior. C.To allow a TCP connection to terminate at a group of endpoints, like the members of an Isis 2 group. All endpoints would deliver identical data.

22 TCP-R We mentioned the Cornell TCP-R technology. The role of TCP-R is: A.To allow a group member to “take over” a TCP endpoint seamlessly, thus allowing transparent fail- over or migration of computing roles. B.To enhance performance of TCP for real-time and gaming uses by changing flow-control behavior. C.To allow a TCP connection to terminate at a group of endpoints, like the members of an Isis 2 group. All endpoints would deliver identical data. When used correctly, a new server (perhaps a backup) can “splice” a new TCP connection to an already-open one that connected to some prior server (perhaps a primary that crashed). TCP-R ensures that not a byte is duplicated or lost, but it does require application help: code you write to checkpoint the TCP-R state and the data sent/received on the connection. In fact there are a number of special versions of TCP for real-time settings. However, to use them on systems like EC2 you would need to run them as application-layer libraries, which is a little tricky to do. The same can be said of TCP-R: none of these options are transparent. There are also TCP fault-tolerance solutions that work this way,.

23 TCP-R in action tcp connection TCP-R black box tcp Initial Server checkpoints Standby Server new tcp connection Mobile client sees no disruption at all and the spliced TCP connection looks identical to the old one. Not a byte is duplicated or lost in either direction

24 Isis 2 plus TCP-R When we say that the application could combine these, we mean that one could use TCP-R to talk to a server group that uses Isis 2 internally to maintain replicated state – The replicated state functions as the checkpoint – However, this is still not at all transparent You must deploy TCP-R and Isis 2 Your server must still include TCP-R state into the data replicated in the group, and must checkpoint at the proper points in time (as per the TCP-R user manual)

25 Backup takes over Consider a general setting in which a group replicates state such as “actions the external users have requested” or “the game state” Now a member fails and a backup takes over A.With Isis 2 this is transparent and seamless B.Isis 2 delivers events that can trigger the take- over but the backup will still need to “figure out” what the member had done prior to failing

26 Backup takes over Consider a general setting in which a group replicates state such as “actions the external users have requested” or “the game state” Now a member fails and a backup takes over A.With Isis 2 this is transparent and seamless B.Isis 2 delivers events that can trigger the take- over but the backup will still need to “figure out” what the member had done prior to failing The new-view event tells you who failed, and you also know that any multicasts sent prior to the failure either have been delivered, or were completely erased by the crash. But the backout would often need to query the “external world” to know if actions the failed process was performing had succeeded or not, e.g. if it was updating a database or activating a piece of hardware or performing other kinds of “external” actions.

27 Out-of-Band Tool The Isis 2 OOB (out of band file transfer) tool: A.Is used to copy memory-mapped files from node to node, at locations where an Isis 2 application has group members. B.Is helpful when dealing with remote clients that are using web services to send data outside of the Isis 2 system C.Provides a way for an application to implement a control layer that oversees some other communication technology, such as with SDN

28 Out-of-Band Tool The Isis 2 OOB (out of band file transfer) tool: A.Is used to copy memory-mapped files from node to node, at locations where an Isis 2 application has group members. B.Is helpful when dealing with remote clients that are using web services to send data outside of the Isis 2 system C.Provides a way for an application to implement a control layer that oversees some other communication technology, such as with SDN Isis 2 multicast works best for small objects, so with the OOB tools, you can move gigabyte objects as memory-mapped files. The multicasts talk about file names and sizes, but the data itself is moved externally to the group, at very high data rates using a form of nearly direct DMA transfer from source to destination(s) Although your application can certainly use WCF or RESTFUL technology to support remote mobile clients, Isis 2 wouldn’t have any direct knowledge about them. The OOB technology only works between members of an Isis 2 process group. Although it would certainly be possible to build new tools similar to the OOB tool for managing a software defined network, we haven’t tried doing that yet with Isis 2

29 OOB for State Transfer When using the OOB tool to accelerate a state transfer, which of the following is not true? A.One option is to put the state in a mapped file, transfer it via OOB, and have the state transfer itself just point to the mapped file. B.One option is to pre-transfer state, then have the state transfer include just the delta of updates that may have occurred after that pre-transfer C.OOB cannot be used in this case because the process is not yet a member of the group

30 When deleting an OOB replica… Suppose that in group {P,Q,R….} P initially has some large object “X” and uses OOB replication to create new replicas at Q and R. A.The copy at P can be deleted in the same OOBRereplicate request that created the copies at Q and R B.The copy at P should not be deleted until after the copies for Q and R have been made

31 When deleting an OOB replica… Suppose that in group {P,Q,R….} P initially has some large object “X” and uses OOB replication to create new replicas at Q and R. A.The copy at P can be deleted in the same OOBRereplicate request that created the copies at Q and R B.The copy at P should not be deleted until after the copies for Q and R have been made

32 OOB for State Transfer When using the OOB tool to accelerate a state transfer, which of the following is not true? A.One option is to put the state in a mapped file, transfer it via OOB, and have the state transfer itself just point to the mapped file. B.One option is to pre-transfer state, then have the state transfer include just the delta of updates that may have occurred after that pre-transfer C.OOB cannot be used in this case because the process is not yet a member of the group

33 OOB for State Transfer When using the OOB tool to accelerate a state transfer, which of the following is not true? A.One option is to put the state in a mapped file, transfer it via OOB, and have the state transfer itself just point to the mapped file. B.One option is to pre-transfer state, then have the state transfer include just the delta of updates that may have occurred after that pre-transfer C.OOB cannot be used in this case because the process is not yet a member of the group There are several ways to work around the “must be a member” limitation. One can do the OOB transfer in some other group, created just for the purpose, or can perform the OOB ReReplicate “during” the state transfer event.


Download ppt "Isis 2 Design Choices A few puzzles to think about when considering use of Isis 2 in your work."

Similar presentations


Ads by Google