Virtual synchrony Ken Birman.

Virtual synchrony Ken Birman

Virtual Synchrony Goal: Simplifies distributed systems development by introducing emulating a simplified world – a synchronous one Features of the virtual synchrony model Process groups with state transfer, automated fault detection and membership reporting Ordered reliable multicast, in several flavors Fault-tolerance, replication tools layered on top Extremely good performance

Process groups Offered as a new and fundamental programming abstraction Just a set of application processes that cooperate for some purpose Could replicate data, coordinate handling of incoming requests or events, perform parallel tasks, or have a shared perspective on some sort of “fact” about the system Can create many of them* * Within limits... Many systems only had limited scalability

Why “virtual” synchrony?
What would a synchronous execution look like? In what ways is a “virtual” synchrony execution not the same thing?

A synchronous execution
p q r s t u With true synchrony executions run in genuine lock-step.

Virtual Synchrony at a glance
p q r s t u With virtual synchrony executions only look “lock step” to the application

Virtual Synchrony at a glance
p q r s t u We use the weakest (least ordered, hence fastest) form of communication possible

Chances to “weaken” ordering
Suppose that any conflicting updates are synchronized using some form of locking Multicast sender will have mutual exclusion Hence simply because we used locks, cbcast delivers conflicting updates in order they were performed! If our system ever does see concurrent multicasts… they must not have conflicted. So it won’t matter if cbcast delivers them in different orders at different recipients!

Causally ordered updates
Each thread corresponds to a different lock In effect: red “events” never conflict with green ones! 2 5 p 1 r s 3 t 2 1 4

In general? Replace “safe” (dynamic uniformity) with a standard multicast when possible Replace abcast with cbcast Replace cbcast with fbcast Unless replies are needed, don’t wait for replies to a multicast

Why “virtual” synchrony?
The user writes code as it will experience a purely synchronous execution Simplifies the developer’s task – very few cases to worry about, and all group members see the same thing at the same “time” But the actual execution is rather concurrent and asynchronous Maximizes performance Reduces risk that lock-step execution will trigger correlated failures

Why groups? Other concurrent work, such as Lamport’s state machines, treat the entire program as a deterministic entity and replicate it But a group replicates state at the “abstract data type” level Each group can correspond to one object This is a good fit with modern styles of application development

Correlated failures Perhaps surprisingly, experiments showed that virtual synchrony makes these less likely! Recall that many programs are buggy Often these are Heisenbugs (order sensitive) With lock-step execution each group member sees group events in identical order So all die in unison With virtual synchrony orders differ So an order-sensitive bug might only kill one group member!

Programming with groups
Many systems just have one group E.g. replicated bank servers Cluster mimics one highly reliable server But we can also use groups at finer granularity E.g. to replicate a shared data structure Now one process might belong to many groups A further reason that different processes might see different inputs and event orders

Embedding groups into “tools”
We can design a groups API: pg_join(), pg_leave(), cbcast()… But we can also use groups to build other higher level mechanisms Distributed algorithms, like snapshot Fault-tolerant request execution Publish-subscribe

Distributed algorithms
Processes that might participate join an appropriate group Now the group view gives a simple leader election rule Everyone sees the same members, in the same order, ranked by when they joined Leader can be, e.g., the “oldest” process

A group can easily solve consensus Leader multicasts: “what’s your input”? All reply: “Mine is 0. Mine is 1” Initiator picks the most common value and multicasts that: the “decision value” If the leader fails, the new leader just restarts the algorithm Puzzle: Does FLP apply here?

A group can easily do consistent snapshot algorithm Either use cbcast throughout system, or build the algorithm over gbcast Two phases: Start snapshot: a first cbcast Finished: a second cbcast, collect process states and channel logs

Distributed algorithms: Summary
Leader election Consensus and other forms of agreement like voting Snapshots, hence deadlock detection, auditing, load balancing

More tools: fault-tolerance
Suppose that we want to offer clients “fault-tolerant request execution” We can replace a traditional service with a group of members Each request is assigned to a primary (ideally, spread the work around) and a backup Primary sends a “cc” of the response to the request to the backup Backup keeps a copy of the request and steps in only if the primary crashes before replying Sometimes called “coordinator/cohort” just to distinguish from “primary/backup”

Coordinator-cohort p q r s t u Q assigned as coordinator for t’s request…. But p takes over if q fails

Coordinator-cohort p q r s t u P picked to perform u’s request. Q stands by until it sees request completion message

Parallel processing p q r s t P and Q split a task. P performs part 1 of 2; Q performs part 2 of 2. Such as searching a large database… they agree on the initial state in which the request was received

Parallel processing p q r s t In this example, r is the cohort and both p and q function as coordinators. If either fails, r can step in and take over its role….

…. As it did in this case, when q fails
Parallel processing p q r s t …. As it did in this case, when q fails

Publish / Subscribe Goal is to support a simple API:
Publish(“topic”, message) Subscribe(“topic”, event_hander) We can just create a group for each topic Publish multicasts to the group Subscribers are the members

Scalability warnings! Many existing group communication systems don’t scale incredibly well E.g. JGroups, Isis, Horus, Ensemble, Spread Group sizes limited to perhaps members And individual processes limited to joining perhaps groups (lightweight groups an exception) Overheads soar as these sizes increase Each group runs protocols oblivious of the others, and this creates huge inefficiency

Publish / Subscribe issue?
We could have thousands of topics! Too many to directly map topics to groups Instead map topics to a smaller set of groups. SPREAD system calls these “lightweight” groups (idea traces to work done by Glade on Isis) Mapping will result in inaccuracies… Filter incoming messages to discard any not actually destined to the receiver process Cornell’s new QuickSilver system instead directly supports immense numbers of groups

Other “toolkit” ideas We could embed group communication into a framework in a “transparent” way Example: CORBA fault-tolerance specification does lock-step replication of deterministic components The client simply can’t see failures But the determinism assumption is painful, and users have been unenthusiastic And exposed to correlated crashes

Other similar ideas There was some work on embedding groups into programming languages But many applications want to use them to link programs coded in different languages and systems Hence an interesting curiosity but just a curiosity Quicksilver: Transparently embeds groups into Windows

Existing toolkits: challenges
Tensions between threading and ordering We need concurrency (threads) for perf. Yet we need to preserve the order in which “events” are delivered This poses a difficult balance for the developers

Features of major virtual synchrony platforms
Isis: First and no longer widely used But was the most successful; has major roles in NYSE, Swiss Exchange, French Air Traffic Control system (two major subsystems of it), US AEGIS Naval warship Also was first to offer a publish-subscribe interface that mapped topics to groups

Totem and Transis Sibling projects, shortly after Isis Totem (UCSB) went on to become Eternal and was the basis of the CORBA fault-tolerance standard Transis (Hebrew University) became a specialist in tolerating partitioning failures, then explored link between vsync and FLP

Horus, JGroups and Ensemble All were developed at Cornell: successors to Isis These focus on flexible protocol stack linked directly into application address space A stack is a pile of micro-protocols Can assemble an optimized solution fitted to specific needs of the application by plugging together “properties this application requires”, lego-style The system is optimized to reduce overheads of this compositional style of protocol stack JGroups is very popular. Ensemble is somewhat popular and supported by a user community. Horus works well but is not widely used.

Horus/JGroups/Ensemble protocol stacks
Application belongs to process group total total parcld fc merge mbrshp mbrshp frag mbrshp frag frag nak frag nak nak comm nak comm comm comm comm

QuickSilver Scalable Multicast
Thinking beyond Web 2.0 Krzysztof Ostrowski, Ken Birman Cornell University We know that with MSN Live, the company is looking at an era beyond today's relatively static web and relatively static search technologies. And we think that scalable publish-subscribe may be the kind of disruptive technical story that can take the web to a completely new level by enabling a tremendous range of applications that are simply out of reach today.

Here's one example. Suppose that Live were to feature something we‘re going to call a 3-dimensional “virtual world” composed of a number of “virtual rooms”. The user experience would be somewhat like that of playing a massively multiplayer game, but the content would be provided by the users rather than Microsoft. Users would create such places much in a way they created web pages today, by designing the appearance, posting multimedia content, but unlike in the today’s static web, users could invite friends, and interact with them in 3-dimensional space, just as they would in a game. Virtual rooms would thus be interactive. For practical purposes, they would also need to be decentralized. It’s infeasible for the user who created a virtual room to have his home machine act as a server. Likewise, it’s infeasible for Microsoft to act as a game server farm and a proxy for all communication in all virtual rooms of the tens of millions of users, just as it’s infeasible for Microsoft to host all Internet content. Users should thus be able to interact directly with one another, without a server in the middle.

A virtual room is an example of a service that needs to be simultaneously interactive and decentralized. Today’s main service-building technologies are ill-suited for building such services, for they can be either interactive or decentralized, but rarely both.

For example, client-server allows to build interactive services, where a number of clients can concurrently, reliably, and consistently modify the service’s state. However, making such systems scale is costly, requires complex infrastructure and maintenance.

Peer to peer systems are very scalable, cheap, self-
Peer to peer systems are very scalable, cheap, self-*, maintenance-free etc., but since the flow of data is usually unidirectional, such systems are not interactive in the sense defined above, and they usually offer only very weak end-to-end reliability guarantees, if any.

We believe that a scalable and reliable variant of the publish-subscribe paradigm, smoothly integrated with the operating system and a development platform, could be a solution.

“Publish-Subscribe” Services (I)
Here’s how it would work in the context of our “virtual worlds” example. Each room in a game would represent a publish-subscribe topic. Players would subscribe or unsubscribe to it as they enter or leave the room. The state of the room, such as player’s positions, the appearance of the room, what each of the players just said etc. would be replicated among all players currently inside. Such state would be loaded by a player upon entering, and subsequently updated as the player receivers messages from other players. Such messages with updates to the room’s state would be disseminated by players who move, shoot, speak, post new content etc., by multicasting them to all players currently in the room reliably and applied in a consistent manner, e.g. in the same order everywhere.

“Publish-Subscribe” Services (II)
In other applications, publish-subscribe topics could represent e.g. collaboratively edited documents, sets of services in a data center that jointly process a given type of customer order, sets of devices governed by a given security policy etc. From this perspective, topics would be created and used very casually, like files or objects.

New Style of Programming
Topics = Objects Topic x = Internet.Enter(“Game X”); Topic y = x.Enter(“Room X”); y.OnShoot += new EventHandler(this.TurnAround); while (true) y.Shoot(new Vector(1,0,0)); This perspective leads to new ways of programming. Publish-subscribe topics can be neatly embedded in a programming language, similarly to how database access is integrated into .NET with Language Integrated Queries (LINQ) in Visual Studio “Orcas”. This would allow even users unfamiliar with publish-subscribe, or with distributed programming, to create such interactive, decentralized services, much in a way Windows Communication Foundation today allows a user unfamiliar with web technologies to create web services.

Similarly, publish-subscribe topics could be made to look and feel just like files to the user. In the example shown here, topics are shown as files in a virtual folder hierarchy through a shell extension (under development for QS/2). The list of topics is obtained by the OS by querying the known “contact points”, (in QS/2, typically a local, DNS-like service or a known provider). Upon right-clicking on a topic, the user is presented with the list of available actions, corresponding to the various ways the topic can be accessed by the user, e.g. “listen” or “record” in case of an audio stream. The actions represent different types of applications that are locally available, and the list is compiled automatically by the OS. Upon the user selecting an action, the OS makes the client application subscribe to the topic, and the application takes it from there, perhaps inspecting subtopics and letting the user navigate among them further.

Typed Publish-Subscribe
The matching is driven by type signatures, assigned to both topics and the client components. In QS/2, such signatures may describe the type of content that is being replicated or disseminated, reliability or security properties of it etc.

Where does QuickSilver belong?
To realize the vision laid out, a publish-subscribe engine must be an integral part of the operating system, somewhat like the .NET framework. The same publish-subscribe topics would then be accessible in a variety of ways: as files through a shell extension, as socket-like objects or other language primitives in the application, or in a way similar to web services in Visual Studio, where the developer can generate a client code stub with all the necessary behaviors and placeholders for the given topic by a single mouse-click.

A number of technologies exist today that are similar to “publish-subscribe”, but none provides all the features that are required to realize our vision. Reliability is essential to reliably disseminate and consistently apply the updates to the replicated state. Scalability is required, in all major dimensions, e.g. to support thousands of topics. Very high performance is needed to support video streaming where throughput is important, games where latency is essential, or to make the technology attractive to data center architects who would expect the engine to utilize the full power of the hardware platform on which it runs. We need clean system embeddings of the sort we described to make publish-subscribe easy to use and develop with. Finally, to make the new technology as successful as web services, we need interoperability standards.

The QuickSilver platform aims at providing all of these features
The QuickSilver platform aims at providing all of these features. The platform consists of two components, which represent our steps towards achieving this goal. QSM is a streaming multicast system that works at network speeds and provides a simple acknowledgement-based reliability property, but unlike other such systems, scales in all major dimensions. The main goal of this system, in the context of our work, is to demonstrate that performance and scalability is not a showstopper. A second, equally important goal is to demonstrate that the first goal is feasible to achieve in a managed environment, an essential requirement for us to be able to tightly integrated QuickSilver with .NET runtime and type system. QSM is a finished system, available for download today. QS/2, the second version of QuickSilver, is currently under construction. The latter will provide very strong reliability guarantees, including virtual synchrony, transaction and consensus-like semantics and even user-defined custom types of reliability, as well as system embeddings and an extremely flexible and extensible, modular architecture.

QuickSilver Scalable Multicast
Simple ACK-based reliability property Managed code (.NET, 95%C#, 5%MC++) Entire QuickSilver platform: ~250 KLOC Throughputs close to network speeds Scalable in multiple dimensions Tested with up to ~200 nodes, 8K groups Robust against a range of perturbances Free: QSM is written entirely in .NET, mostly in C#, with only a small amount of code in C++ to implement our event-scheduling engine using Windows I/O completion ports. On our 100Mbps network, with uni-processor 1.3GHz cluster nodes, QSM provides performance very close to network limits. Within the ranges in which we tested our system, with up to 192 nodes and 8192 groups, performance degrades by only a few percent. Furthermore, QSM well tolerates many types of perturbances.

Here are some numbers to illustrate this
Here are some numbers to illustrate this. In the two pictures on the left, we measure throughput as a function of the number of nodes (with a single topic, and 1 or 2 senders multicasting simultaneously), or as a function of the number of topics (all of which, in this experiment, completely and exactly overlap on the same set of 110 receivers). In the upper right corner, we measure the total CPU utilization as a function of throughput on sender and receiver nodes in an experiment with a single sender multicasting to a single 110-member topic. Finally, in the bottom right corner, we demonstrate how QSM responds to perturbances. Here again, a single sender multicasts to a single 110-node group of receivers, at the maximum possible rate. A single receiver is perturbed, either (in the “freezes” scenario) by periodically, every 5 seconds, spinning in a “for” loop for 0.5s (QSM is single-threaded, hence from the QSM perspective, this stops the flow of time), or (in the “losses” scenario) by periodically, every 1 second, making QSM drop all incoming data and control packets for 10ms, thus amounting to 1% loss (in reality, the measured loss is 4-5% due to the negative consequences of the system being disrupted, with the repair traffic flowing in parallel the regular multicast).

Summary? Role of a toolkit is to package commonly used, popular functionality into simple API and programming model Group communication systems have been more popular when offered in toolkits If groups are embedded into programming languages, we limit interoperability If groups are used to transparently replicate deterministic objects, we’re too inflexible Many modern systems let you match the protocol to your application’s requirements

Virtual synchrony Ken Birman.

Similar presentations

Presentation on theme: "Virtual synchrony Ken Birman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Virtual synchrony Ken Birman.

Similar presentations

Presentation on theme: "Virtual synchrony Ken Birman."— Presentation transcript:

Similar presentations

About project

Feedback