Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications.

Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications - Examples All based on Ken Birman’s slide set

Architecture Membership Agreement, “join/leave” and “P seems to be unresponsive” 3PC-like protocols use membership changes instead of failure notification Applications use replicated data for high availability

Issues? How to “detect” failures Can use timeout Or could use other system monitoring tools and interfaces Sometimes can exploit hardware Tracking membership Basically, need a new replicated service System membership “lists” are the data it manages We’ll say it takes join/leave requests as input and produces “views” as output

Architecture GMS A B C D join leave join A seems to have failed {A} {A,B,D} {A,D} {A,D,C} {D,C} XYZ Application processes GMS processes membership views

Issues Group membership service (GMS) has just a small number of members This core set will tracks membership for a large number of system processes Internally it runs a group membership protocol (GMP) Full system membership list is just replicated data managed by GMS members, updated using multicast

GMP design What protocol should we use to track the membership of GMS Must avoid split-brain problem Desire continuous availability We’ll see that a version of 3PC can be used But can’t “always” guarantee liveness

Multicast Primitives To support replication

Ordering properties: FIFO Fifo or sender ordered multicast: fbcast Messages are delivered in the order they were sent (by any single sender) pqrspqrs ae

Ordering properties: FIFO Fifo or sender ordered multicast: fbcast Messages are delivered in the order they were sent (by any single sender) pqrspqrs a bcd e delivery of c to p is delayed until after b is delivered

Implementing FIFO order Basic reliable multicast algorithm has this property Without failures all we need is to run it on FIFO channels (like TCP, except “wired” to our GMS With failures need to be careful about the order in which things are done but problem is simple Multithreaded applications: must carefully use locking or order can be lost as soon as delivery occurs!

Ordering properties: Causal Causal or happens-before ordering: cbcast If send(a)  send(b) then deliver(a) occurs before deliver(b) at common destinations pqrspqrs a b

Ordering properties: Causal Causal or happens-before ordering: cbcast If send(a)  send(b) then deliver(a) occurs before deliver(b) at common destinations pqrspqrs a bc delivery of c to p is delayed until after b is delivered

Ordering properties: Causal Causal or happens-before ordering: cbcast If send(a)  send(b) then deliver(a) occurs before deliver(b) at common destinations pqrspqrs a bc e delivery of c to p is delayed until after b is delivered e is sent (causally) after b

Ordering properties: Causal Causal or happens-before ordering: cbcast If send(a)  send(b) then deliver(a) occurs before deliver(b) at common destinations pqrspqrs a bcd e delivery of c to p is delayed until after b is delivered delivery of e to r is delayed until after b&c are delivered

Implementing causal order Start with a FIFO multicast Frank Schmuck showed that we can always strengthen this into a causal multicast by adding vector time (no additional messages needed) If group membership were static this is easily done, small overhead With dynamic membership, at least abstractly, we need to identify each VT index with the corresponding process, which seems to double the size

Insights about c/fbcast These two primitives are asynchronous : Sender doesn’t get blocked and can deliver a copy to itself without “stopping” to learn a safe delivery order If used this way, the multicast can seem to sit in the output buffers a long time, leading to surprising behavior But this also gives the system a chance to concatenate multiple small messages into one larger one.

Concatenation Application sends 3 asynchronous cbcasts Multicast Subsystem Message layer of multicast system combines them in a single packet

State Machine Concept Sometimes, we want a replicated object or service that advances through a series of “state machine transitions” Clearly will need all copies to make the same transitions Leads to a need for totally ordered multicast

Ordering properties: Total Total or locally total multicast: abcast Messages are delivered in same order to all recipients (including the sender) pqrspqrs a b c d e all deliver a, b, c, d, then e

Ordering properties: Total Can visualize as “closely synchronous” Real delivery is less synchronous, as on the previous slide pqrspqrs a b c d e all deliver a, b, c, d, then e

Often conceive of causal order as a form of total order! Point is that causal order is totally ordered for any single causal chain. We’ll use this observation later pqrspqrs a b c d e all receive a, b, c, d, then e

Implementing Total Order Many ways have been proposed Just have a token that moves around Token has a sequence number When you hold the token you can send the next burst of multicasts Extends to a cbcast-based version We use this when there are multiple concurrent threads sending messages Transis and Totem extend VT causal order to a total order But their scheme involves delaying messages and sometimes sends extra multicasts

How to think about order? Usually, we think in terms of state machines and total order But all the total order approaches are costly There is always someone who may need to wait for a token or may need to delay delivery Loses benefit of asynchronous execution Could be several orders of magnitude slower! So often we prefer to find sneaky ways to use fbcast or cbcast instead

Reliable Distributed Systems Virtual Synchrony

A powerful programming model! Called virtual synchrony It offers Process groups with state transfer, automated fault detection and membership reporting Ordered reliable multicast, in several flavors Extremely good performance

Why “virtual” synchrony? What would a synchronous execution look like? In what ways is a “virtual” synchrony execution not the same thing?

A synchronous execution p q r s t u With true synchrony executions run in genuine lock-step.

Virtual Synchrony at a glance With virtual synchrony executions only look “lock step” to the application p q r s t u

What about membership changes? Virtual synchrony model synchronizes membership change with multicasts Idea is that: Between any pair of successive group membership views… … same set of multicasts are delivered to all members If you implement code this makes algorithms much simpler for you!

Process groups with joins, failures crash G 0 ={p,q} G 1 ={p,q,r,s} G 2 ={q,r,s} G 3 ={q,r,s,t} pqrstpqrst r, s request to join r,s added; state xfer t added, state xfer t requests to join p fails

Implementation? When membership view is changing, we also need to terminate any pending multicasts Involves wiring the fault-tolerance mechanism of the multicast to the view change notification Tricky but not particularly hard to do Resulting scheme performs well if implemented carefully

Virtual Synchrony at a glance p q r s t u We use the weakest (hence fastest) form of communication possible

Chances to “weaken” ordering Suppose that any conflicting updates are synchronized using some form of locking Multicast sender will have mutual exclusion Hence simply because we used locks, cbcast delivers conflicting updates in order they were performed! If our system ever does see concurrent multicasts… they must not have conflicted. So it won’t matter if cbcast delivers them in different orders at different recipients!

Causally ordered updates Each thread corresponds to a different lock In effect: red “events” never conflict with green ones! p r s t 1 2 3 4 5 1 2

In general? Replace “safe” (dynamic uniformity) with a standard multicast when possible Replace abcast with cbcast Replace cbcast with fbcast Unless replies are needed, don’t wait for replies to a multicast

Why “virtual” synchrony? The user sees what looks like a synchronous execution Simplifies the developer’s task But the actual execution is rather concurrent and asynchronous Maximizes performance Reduces risk that lock-step execution will trigger correlated failures

Correlated failures Why do we claim that virtual synchrony makes these less likely? Recall that many programs are buggy Often these are Heisenbugs (order sensitive) With lock-step execution each group member sees group events in identical order So all die in unison With virtual synchrony orders differ So an order-sensitive bug might only kill one group member!

Programming with groups Many systems just have one group E.g. replicated bank servers Cluster mimics one highly reliable server But we can also use groups at finer granularity E.g. to replicate a shared data structure Now one process might belong to many groups A further reason that different processes might see different inputs and event orders

Embedding groups into “tools” We can design a groups API: pg_join(), pg_leave(), cbcast()… But we can also use groups to build other higher level mechanisms Distributed algorithms, like snapshot Fault-tolerant request execution Publish-subscribe

Distributed algorithms Processes that might participate join an appropriate group Now the group view gives a simple leader election rule Everyone sees the same members, in the same order, ranked by when they joined Leader can be, e.g., the “oldest” process

Distributed algorithms A group can easily solve consensus Leader multicasts: “what’s your input”? All reply: “Mine is 0. Mine is 1” Initiator picks the most common value and multicasts that: the “decision value” If the leader fails, the new leader just restarts the algorithm

Distributed algorithms A group can easily do consistent snapshot algorithm Either use cbcast throughout system, or build the algorithm over gbcast Two phases: Start snapshot: a first cbcast Finished: a second cbcast, collect process states and channel logs

Distributed algorithms: Summary Leader election Consensus and other forms of agreement like voting Snapshots, hence deadlock detection, auditing, load balancing

More tools: fault-tolerance Suppose that we want to offer clients “fault- tolerant request execution” We can replace a traditional service with a group of members Each request is assigned to a primary (ideally, spread the work around) and a backup Primary sends a “cc” of the response to the request to the backup Backup keeps a copy of the request and steps in only if the primary crashes before replying Sometimes called “coordinator/cohort” just to distinguish from “primary/backup”

Publish / Subscribe Goal is to support a simple API: Publish(“topic”, message) Subscribe(“topic”, event_hander) We can just create a group for each topic Publish multicasts to the group Subscribers are the members

Scalability warnings! Many existing group communication systems don’t scale incredibly well E.g. JGroups, Ensemble, Spread Group sizes limited to perhaps 50-75 members And individual processes limited to joining perhaps 50-75 groups (Spread: see next slide) Overheads soar as these sizes increase Each group runs protocols oblivious of the others, and this creates huge inefficiency

Publish / Subscribe issue? We could have thousands of topics! Too many to directly map topics to groups Instead map topics to a smaller set of groups. SPREAD system calls these “lightweight” groups Mapping will result in inaccuracies… Filter incoming messages to discard any not actually destined to the receiver process Cornell’s new QuickSilver system will instead directly support immense numbers of groups

Other “toolkit” ideas We could embed group communication into a framework in a “transparent” way Example: CORBA fault-tolerance specification does lock-step replication of deterministic components The client simply can’t see failures But the determinism assumption is painful, and users have been unenthusiastic And exposed to correlated crashes

Other similar ideas There was some work on embedding groups into programming languages But many applications want to use them to link programs coded in different languages and systems Hence an interesting curiosity but just a curiosity More work is needed on the whole issue

Existing toolkits: challenges Tensions between threading and ordering We need concurrency (threads) for perf. Yet we need to preserve the order in which “events” are delivered This poses a difficult balance for the developers

Preserving order Group Communication Subsystem: A library linked to the application, perhaps with its own daemon processes G 1 ={p,q}m3m3 m4m4 G 2 ={p,q,r} Time  application p q r m1m1 m2m2 m3m3 m4m4

The tradeoff If we deliver these upcalls in separate threads, concurrency increases but order could be lost If we deliver them as a list of event, application receives events in order but if it uses thread pools (think SEDA), the order is lost

Solution used in Horus This system Delivered upcalls using an event model Each event was numbered User was free to Run a single-threaded app Use a SEDA model Toolkit included an “enter/leave region in order” synchronization primitive Forced threads to enter in event-number order

Other toolkit “issues” Does the toolkit distinguish members of a group from clients of that group? In Isis system, a client of a group was able to multicast to it, with vsync properties But only members received events Does the system offer properties “across group boundaries”? For example, using cbcast in multiple groups

Features of major virtual synchrony platforms Isis: First and no longer widely used But was perhaps the most successful; has major roles in NYSE, Swiss Exchange, French Air Traffic Control system (two major subsystems of it), US AEGIS Naval warship Also was first to offer a publish-subscribe interface that mapped topics to groups

Features of major virtual synchrony platforms Totem and Transis Sibling projects, shortly after Isis Totem (UCSB) went on to become Eternal and was the basis of the CORBA fault- tolerance standard Transis (Hebrew University) became a specialist in tolerating partitioning failures, then explored link between vsync and FLP

Features of major virtual synchrony platforms Horus, JGroups and Ensemble All were developed at Cornell: successors to Isis These focus on flexible protocol stack linked directly into application address space A stack is a pile of micro-protocols Can assemble an optimized solution fitted to specific needs of the application by plugging together “properties this application requires”, lego-style The system is optimized to reduce overheads of this compositional style of protocol stack JGroups is very popular. Ensemble is somewhat popular and supported by a user community. Horus works well but is not widely used.

JGroups (part of JBoss) Developed by Bela Ban Implements group multicast tools Virtual synchrony was on their “to do” list But they have group views, multicast, weaker forms of reliability Impressive performance! Very popular for Java community Downloads from http://www.JGroups.org http://www.JGroups.org

Spread Toolkit Developed at John Hopkins Focused on a sort of “RISC” approach Very simple architecture and system Fairly fast, easy to use, rather popular Supports one large group within which user sees many small “lightweight” subgroups that seem to be free-standing Protocols implemented by Spread “agents” that relay messages to apps

Summary? Role of a toolkit is to package commonly used, popular functionality into simple API and programming model Group communication systems have been more popular when offered in toolkits If groups are embedded into programming languages, we limit interoperability If groups are used to transparently replicate deterministic objects, we’re too inflexible Many modern systems let you match the protocol to your application’s requirements

Reliable Distributed Systems Applications

Applications of GCS Over the past three weeks we’ve heard about group communication Process groups Membership tracking and reporting “new views” Reliable multicast, ordered in various ways Dynamic uniformity (safety), quorum protocols So we know how to build group multicast… but what good are these things?

Applications of GCS Today, we’ll review some practical applications of the mechanisms we’ve studied Each is representative of a class Goal is to illustrate the wide scope of these mechanisms, their power, and the ways you might use them in your own work

Sample Applications Wrappers and Toolkits Distributed Program- ming Languages Wrapping a Simple RPC server Wrapping a Web Site Hardening Other Aspects of the Web Unbreakable Stream Connections Reliable Distributed Shared Memory

What should the user “see”? Presentation of group communication tools to end users has been a controversial topic for decades! Some schools of thought: Direct interface for creating and using groups Hide in a familiar abstraction like publish-subscribe or Windows event notification Use inside something else, like a cluster mgt. platform a new programming language Each approach has pros and cons

Toolkits Most systems that offer group communication directly have toolkit interfaces User sees a library with various calls and callbacks These are organized into “tools”

Style of coding? User writes a program in Java, C, C++, C#... The program declares “handlers” for events like new views, arriving messages Then it joins groups and can send/receive multicasts Normally, it would also use threads to interact with users via a GUI or do other useful things

Toolkit approach: Isis Join a group, state transfer: Gid = pg_join(“group-name”, PG_INIT, init_func, PG_NEWVIEW, got_newview, XFER_IN, rcv_state, XFER_OUT, snd_state, … 0); Multicast to a group: nr = abcast(gid, REQ, “%s,%d”, “a string”, 1234, ALL, “%f”, &fltvec); Register a callback handler for incoming messages isis_entry(REQ, got_msg); Receive a multicast: void got_msg(message *mp) { Msg_scan(“%s,%d”, &astring, &anint); Reply(mp, “%f”, 123.45); } A group is created when a join is first issued. In this case the group initializer function is called. The user needs to code that function. Here the “new view” function, also supplied by the user, gets called when the group membership changes If the group already exists, a leader is automatically selected and its XFER_OUT routine is called. It calls xfer_out repeatedly to send state. Each call results in a message delivered to the XFER_IN routine, which extracts the state from the message To send a multicast (here, a totally ordered one), you specify the group identifier from a join or lookup, a request code (an integer), and then the message. This multicast builds a message using a C-style format string. This abcast wants a reply from all members; the replies are floating point numbers and the set of replies is stored in a vector specified by the caller. Abcast tells the caller how many replies it actually got (nr) This is how an application registers a callback handler. In this case the application is saying that messages with the specified request code should be passed to the procedure “got_msg” Here’s got_msg. It gets invoked when a multicast arrived with the matching request code. This particular procedure extracts a string and an integer from the message and sends a reply. Abcast will collect all of those replies into a vector, set the caller’s pointer to point to that vector, and return the number of replies it received (namely, the number of members in the current view)

Threading A tricky topic in Isis The user needs threads, e.g. to deal with I/O from the client while also listening for incoming messages, or to accept new requests while waiting for replies to an RPC or multicast But the user also needs to know that messages and new views are delivered in order, hence concurrent threads pose issues Solution? Isis acts like a “monitor” with threads, but running them one at a time unless the user explicitly “exits” the monitor

A tricky model to work with! We have… Threads, which many people find tricky Virtual synchrony, including choices of ordering A new distributed “abstraction” (groups) Developers will be making lots of choices, some with big performance implications, and this is a negative

Examples of tools in toolkit Group join, state xfer Leader selection Holding a “token” Checkpointing a group Data replication Locking Primary-backup Load-balancing Distributed snapshot

How toolkits work They offer a programmer API More procedures, e.g. Create_replicated_data(“name”, type) Lock_replica(“name”) Update_replica(“name”, value) V = (type)Read_replica(“name”) Internally, these use groups & multicast Perhaps, asynchronous cbcast as discussed last week… Toolkit builder optimizes extensively, etc…

How programmers use toolkits Two main styles Replicating a data structure For example, “air traffic sector D-5” Consists of all the data associated with that structure… could be quite elaborate Processes sharing the structure could be very different (maybe not even the same language) Replicating a service For high availability, load-balancing

Experience is mixed…. Note that many systems use group communication but don’t offer “toolkits” to developers/end users Major toolkit successes include New York and Swiss Stock Exchange, French Air Traffic Control System, US AEGIS warship, various VLSI Fab systems, etc But building them demanded special programmer expertise and knowledge of a large, complex platform Not every tool works in every situation! Performance surprises & idiosyncratic behavior common. Toolkits never caught on the way that transactions became standard But there are several popular toolkits, like JGroups, Spread and Ensemble. Many people do use them

Leads to notion of “wrappers” Suppose that we could have a magic wand and wave it at some system component “Replicatum transparentus!” Could we “automate” the use of tools and hide the details from programmers?

Wrapper examples Transparently… Take an existing service and “wrap” it so as to replicate inputs, making it fault- tolerant Take a file or database and “wrap” it so that it will be replicated for high availability Take a communication channel and “wrap” it so that instead of connecting to a single server, it connects to a group

Experience with wrappers? Transparency isn’t always a good thing CORBA has a fault-tolerance wrapper In CORBA, programs are “active objects” The wrapper requires that these be deterministic objects with no GUI (e.g. servers) CORBA replaces the object with a group, and uses abcast to send requests to the group. Members do the same thing, “state machine” style So replies are identical. Give the client the first one

Why CORBA f.tol. was a flop Users find the determinism assumption too constraining Prevents use of threads, shared memory, system clock, timers, multiple I/O channels… Real programs sometimes use these sorts of things unknown to the programmer Who knows how the.NET I/O library was programmed by Microsoft? Could it have threads inside, or timers? Moreover, costs were high Twice as much hardware… slower performance! Also, had to purchase the technology separately from your basic ORB (and for almost same price)

Files and databases? Here, issue is that there are other ways to solve the same problem A file, for example, could be put on a RAID file server This provides high speed and high capacity and fault-tolerance too Software replication can’t easily compete

How about “TCP to a group?” This is a neat application and very interesting to discuss. We saw it before. Let’s look at it again, carefully Goals: Client system runs standard, unchanged TCP Server replaced by a group… leader owns the TCP endpoint but if it crashes, someone else takes over and client sees no disruption at all!

How would this work? Revisit idea from before: Reminder: TCP is a kind of state machine Events occur (incoming IP packets, timeouts, read/write requests from app) These trigger “actions” (sending data packets, acks, nacks, retransmission) We can potentially checkpoint the state of a TCP connection or even replicate it in realtime!

How to “move” a TCP connection We need to move the IP address We know that in the modern internet, IP addresses do move, all the time NATs and firewalls do this, why can’t we? We would also need to move the TCP connection “state” Depending on how TCP was implemented this may actually be easy!

Migrating a TCP connection client Initial Server New Server Client “knows” the server by its TCP endpoint: an IP address and port that speak TCP and have the state of this connection The server-side state consists of the contents of the TCP window (on the server), the socket to which the IP address and port are bound, and timeouts or ACK/NACK “pending actions” We can write this into a checkpoint record TCP state We transmit the TCP state (with any other tasks we migrate) to the new server. It opens a socket, binds to the SAME IP address, initializes its TCP stack out of the checkpoint received from the old server The client never even notices that the channel endpoint was moved! The old server discards its connection endpoint

TCP connection state Includes: The IP address, port # used by the client and the IP address and port on the server Best to think of the server as temporarily exhibiting a “virtual address” That address can be moved Contents of the TCP “window” We can write this down and move it too ACK/NACK state, timeouts

Generalizing the idea Create a process group Use multicasts when each event occurs (abcast) All replicas can track state of the leader Now if a new view shows that the leader has failed, a replica can take over by binding to the IP address

Fault-tolerant TCP connection client Initial Server New Server With replication technology we could continuously replicate the connection state (as well as any “per task” state needed by the server)

Fault-tolerant TCP connection client Initial Server New Server After a failure, the new server could take over, masking the fault. The client doesn’t notice anything

What’s new? Before we didn’t know much about multicast… now we do This lets us ask how costly the solution would be In particular Which multicast should be used? When would a delay be incurred?

Choice of multicast We need to be sure that everyone sees events in the identical order Sounds like abcast But in fact there is only a single sender at a time, namely the leader Fbcast is actually adequate! Advantage: leader doesn’t need to multicast to itself, only to the replicas

Timeline picture client leader replica An IP packet generated by TCP Leader fbcasts the “event description” Leader bound to IP address replica binds to IP address, now it owns the TCP stack Leader doesn’t need to wait (to “sync”) here because the client can’t see any evidence of the leader’s TCP protocol stack state Leader does need to wait before sending this IP packet to the client, (to “sync”) to be sure that if he crashes, client TCP stack will be in the same state as his was

Asynchronous multicast This term is used when we can send a multicast without waiting for replies Our example uses asynchronous fbcast An especially cheap protocol: often just sends a UDP packet Acks and so forth can happen later and be amortized over many multicasts “Sync” is slower: must wait for an ack But often occurs in background while leader is processing the request, “hiding” the cost!

Sources of delay? Building event messages to represent TCP state, sending them But this can occur concurrently with handing data to the application and letting it do whatever work is required Unless TCP data is huge, delay is very small Synchronization before sending packets of any kind to client Must be certain that replica is in the identical state

How visible will delay be? This version of TCP May notice overhead for very small round-trip interactions: puts the sync event right in the measured RTT path Although replica is probably close by with a very fast connection to the leader, whereas client is probably far away with a slow connection… But could seem pretty much as fast as a normal TCP if the application runs for a long time, since that time will hide the delay of synchronizing leader with replica!

Using our solution? Now we can wrap a web site or some other service Run one copy on each of two or more machines Use our replicated TCP Application sees identical inputs and produces identical outputs…

Repeat of CORBA f.tol. idea? Not exactly… We do need determinism with respect to the TCP inputs But in fact we don’t need to legislate that “the application must be a deterministic object” Users could, for example, use threads as long as they ensure that identical TCP inputs result in identical replies

Determinism worry Recall that CORBA transparently replicates objects But insists that they be deterministic And this was an unpopular requirement Our “Web Services wrapper” does too But only requires determinism with respect to the TCP inputs The server could be quite concurrent as long as its state and actions will be identical given same TCP request sequence: a less demanding requirement

Would users accept this? Unknown: This style of wrapping has never been explored in commercial products But the idea seems appealing… perhaps someone in the class will find out…

Distributed shared memory A new goal: software DSM Looks like a memory-mapped file But data is automatically replicated, so all users see identical content Requires a way for DSM server to intercept write operations

Some insights that might help Assume that programs have locality In particular, that there tends to be one writer in a given DSM page at a time Moreover, that both writers and readers get some form of locks first Why are these legitimate assumptions? Lacking them, application would be highly non-deterministic and probably incorrect

So what’s the model? Application “maps” a region of memory While running, it sometimes Acquires a read or write lock Then for a period of time reads or writes some part of the DSM (some “pages”) Then releases the lock Gee… this is just our distributed replication model in a new form!

To implement this DSM… We need a way to Implement the mapping Detect that a page has become dirty Invoke our communication primitives when a lock is requested or released Idea: Use the Linux mapped file primitives and build a DSM “daemon” to send updates Intercept Linux semaphore operations for synchronization

DSM with a daemon DSMD Wrapper intercepts mmap and semaphore operations and redirects those associated with the shared memory region to the DSMD. We’ll assume that the developer comes up with a sensible convention for associating semaphores either with entire mapped regions, or with pages of them Mmap creates shared memory regions. The DSMD will multicast the contents of a page when the associated semaphore lock is released. Properties of the multicast and of the locking “protocol” determine the DSM properties seen by the user. The user doesn’t use multicast directly

Design choices? We need to decide how semaphores are associated with the mapped memory E.g. could have one semaphore for the whole region; treat it as an exclusive lock Or could have one per page Could event implement a readers/writers mechanism, although this would depart from the Linux semaphore API

Design choices? Must also pick a memory coherency model: Strong consistency: The DSM behaves like a single non- replicated memory Weak consistency: The DSM can be highly inconsistent. Updates propagate after an unspecified and possibly long delay, and copies of the mapped region may differ Release consistency (DASH project): Requires locking for mutual exclusion; consistent as long as locking is used Causal consistency (Neiger and Hutto): If DSM update a  b, then b will observe the results of a.

Best choice? We should probably pick release consistency or causal consistency Release consistency requires fbcast Causal consistency would use cbcast The updates end up totally ordered along mutual exclusion paths and the primitive is strong enough to maintain this delivery ordering at all copies

False sharing One issue designer must worry about Suppose multiple independent objects map to the same page but have distinct locks In a traditional hardware DSM page ends up ping-ponging between the machines In our solution, this just won’t work! Our mechanism requires that there be one lock per “page”

Would this work? In fact it can work extremely well In years past, students have implemented this form of DSM as a course project Performance is remarkably good if the application “understands” the properties of the DSM Notice that DSM is really just a different API for offering multicast to user…

“Tools” we didn’t discuss today Many people like publish-subscribe Could just map topics to groups But this requires that the group communication system scale extremely well in the numbers of groups, a property not all GCS platforms exhibit Interesting current research topic JGroups, Ensemble just have regular groups and can’t handle apps that create millions of them Spread tackles with “lightweight” groups… but his has some overheads (it delivers, then discards, extra msgs) QuickSilver now investigating a new approach

Recap of today’s lecture … we’ve looked at each of these topics and seen that with a group multicast platform, the problem isn’t hard to solve Wrappers and Toolkits Distributed Program- ming Languages Wrapping a Simple RPC server Wrapping a Web Site Hardening Other Aspects of the Web Unbreakable Stream Connections Reliable Distributed Shared Memory [skipped]

Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications.

Similar presentations

Presentation on theme: "Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications.

Similar presentations

Presentation on theme: "Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications."— Presentation transcript:

Similar presentations

About project

Feedback