Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes.

Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes Gehrke, Paul Francis, Robbert van Renesse, Werner Vogels Raytheon Corporation Lou DiPalma, Paul Work

Approved for Public Release, Distribution Unlimited July 21, 20042 Our topic Computing systems are growing … larger, … and more complex, … and we are hoping to use them in a more and more “unattended” manner But the technology for managing growth and complexity is lagging

Approved for Public Release, Distribution Unlimited July 21, 20043 Our goal Build a new platform in support of massively scalable, self-regenerative applications Demonstrate it by offering a specific military application interface Work with Raytheon to apply in other military settings

Approved for Public Release, Distribution Unlimited July 21, 20044 Representative scenarios Massive data centers maintained by the military (or by companies like Amazon) Enormous publish-subscribe information bus systems (broadly, OSD calls these GIG and NCES systems) Deployments of large numbers of lightweight sensors New network architectures to control autonomous vehicles over media shared with other “mundane” applications

Approved for Public Release, Distribution Unlimited July 21, 20045 How to approach the problem? Web Services architecture has emerged as a likely standard for large systems But WS is “document oriented,” lacks High availability (or any kind of quick response guarantees) A convincing scalability story… Self-monitoring/adaptation features

Approved for Public Release, Distribution Unlimited July 21, 20046 Signs of trouble? Most technologies are way beyond their normal scalability limits in this kind of center: we are “good” at small clusters but not huge ones Pub-sub was a big hit. No longer… Curious side-bar: used heavily for point-to- point communication! (Why?) Extremely hard to diagnose problems

Approved for Public Release, Distribution Unlimited July 21, 20047 We lack the right tools! Today, our applications navigate in the dark They lack a way to find things They lack a way to sense system state There are no rules for adaptation, if/when needed In effect: We are starting to build very big systems, yet doing so in the usual client- server manner This denies applications any information about system state, configuration, loads, etc

Approved for Public Release, Distribution Unlimited July 21, 20048 QuickSilver QuickSilver: A platform to help developers build these massive new systems It has four major components Astrolabe: a novel kind of “virtual database” Bimodal Multicast: for faster “few to many” data transfer patterns Kelips: A fast “lookup” mechanism Group replication technologies based on virtual synchrony or other similar models

Approved for Public Release, Distribution Unlimited July 21, 20049 QuickSilver Architecture Pub-sub (JMS, JBI)Native API Massively Scalable Group Communication Composable Microprotocol Stacks Monitoring Indexing Message Repository Overlay Networks Distributed query, event detection

ASTROLABE Astrolabe’s role is to collect and report system state, which is used for many purposes including self- configuration and repair.

Approved for Public Release, Distribution Unlimited July 21, 200411 What does Astrolabe do? Astrolabe’s role is to track information residing at a vast number of sources Structured to look like a database Approach: “peer to peer gossip”. Basically, each machine has a piece of a jigsaw puzzle. Assemble it on the fly.

Approved for Public Release, Distribution Unlimited July 21, 200412 Astrolabe in a single domain NameLoadWeblogic?SMTP?Word Version… swift2.0016.2 falcon1.5104.1 cardinal4.5106.0 Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what data is pulled into the table (and can change) 3.1 5.3 0.9 1.9 3.6 0.8 2.1 2.7 1.1 1.8

Approved for Public Release, Distribution Unlimited July 21, 200413 So how does it work? Each computer has Its own row Replicas of some objects (configuration certificate, other rows, etc) Periodically, but at a fixed rate, pick a friend “pseudo-randomly” and exchange states efficiently (bound the size of data exchanged) States converge exponentially rapidly. Loads are low and constant and protocol is robust against all sorts of disruptions!

Approved for Public Release, Distribution Unlimited July 21, 200414 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift2003.67016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic?SMTP?Word Versi on swift20112.0016.2 falcon19711.5104.1 cardinal20044.5106.0 swift.cs.cornell.edu cardinal.cs.cornell.edu

Approved for Public Release, Distribution Unlimited July 21, 200415 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift2003.67016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic?SMTP?Word Versi on swift20112.0016.2 falcon19711.5104.1 cardinal20044.5106.0 swift.cs.cornell.edu cardinal.cs.cornell.edu swift20112.0 cardinal22013.5

Approved for Public Release, Distribution Unlimited July 21, 200416 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic?SMTP?Word Versi on swift20112.0016.2 falcon19711.5104.1 cardinal22013.5106.0 swift.cs.cornell.edu cardinal.cs.cornell.edu

Approved for Public Release, Distribution Unlimited July 21, 200417 Observations Merge protocol has constant cost One message sent, received (on avg) per unit time. The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size In Astrolabe, we limit them to 50-100 rows

Approved for Public Release, Distribution Unlimited July 21, 200418 Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 cardinal.cs.cornell.edu NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0

Approved for Public Release, Distribution Unlimited July 21, 200419 Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers NameLoadWeblogic?SMTP?Word Version … swift2.0016.2 falcon1.5104.1 cardinal4.5106.0 NameLoadWeblogic?SMTP?Word Version … gazelle1.7004.5 zebra3.2016.2 gnu.5106.2 NameAvg Load WL contactSMTP contact SF2.6123.45.61.3123.45.61.17 NJ1.8127.16.77.6127.16.77.11 Paris3.114.66.71.814.66.71.12 San Francisco New Jersey SQL query “summarizes” data Dynamically changing query output is visible system-wide

Approved for Public Release, Distribution Unlimited July 21, 200420 (1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy NameLoadWeblogic?SMTP?Word Version … swift2.0016.2 falcon1.5104.1 cardinal4.5106.0 NameLoadWeblogic?SMTP?Word Version … gazelle1.7004.5 zebra3.2016.2 gnu.5106.2 NameAvg Load WL contactSMTP contact SF2.6123.45.61.3123.45.61.17 NJ1.8127.16.77.6127.16.77.11 Paris3.114.66.71.814.66.71.12 San Francisco New Jersey 1 33 1 2 2

Approved for Public Release, Distribution Unlimited July 21, 200421 Hierarchy is virtual… data is replicated NameLoadWeblogic?SMTP?Word Version … swift2.0016.2 falcon1.5104.1 cardinal4.5106.0 NameLoadWeblogic?SMTP?Word Version … gazelle1.7004.5 zebra3.2016.2 gnu.5106.2 NameAvg Load WL contactSMTP contact SF2.6123.45.61.3123.45.61.17 NJ1.8127.16.77.6127.16.77.11 Paris3.114.66.71.814.66.71.12 San Francisco New Jersey

Approved for Public Release, Distribution Unlimited July 21, 200422 Hierarchy is virtual… data is replicated NameLoadWeblogic?SMTP?Word Version … swift2.0016.2 falcon1.5104.1 cardinal4.5106.0 NameLoadWeblogic?SMTP?Word Version … gazelle1.7004.5 zebra3.2016.2 gnu.5106.2 NameAvg Load WL contactSMTP contact SF2.6123.45.61.3123.45.61.17 NJ1.8127.16.77.6127.16.77.11 Paris3.114.66.71.814.66.71.12 San Francisco New Jersey

Approved for Public Release, Distribution Unlimited July 21, 200423 The key to self-* properties! A flexible, reprogrammable mechanism Which clustered services are experiencing timeouts, and what were they waiting for when they happened? Find 12 idle machines with the NMR-3D package that can download a 20MB dataset rapidly Which machines have inventory for warehouse 9? Where’s the cheapest gasoline in the area? Think of aggregation functions as small agents that look for information

Approved for Public Release, Distribution Unlimited July 21, 200424 What about security? Astrolabe requires Read permissions to see database Write permissions to contribute data Administrative permission to change aggregation or configuration certificates Users decide what data Astrolabe can see A VPN setup can be used to hide Astrolabe’s internal messages from intruders Byzantine Agreement based on threshold crypto used to secure aggregation functions New!

Approved for Public Release, Distribution Unlimited July 21, 200425 Data Mining Quite a hot area, usually done by collecting information to a centralized node, then “querying” within that node Astrolabe is doing the comparable thing, but its query evaluation occurs in a decentralized manner This is incredibly parallel, hence faster And more robust against disruption too!

Approved for Public Release, Distribution Unlimited July 21, 200426 Cool Astrolabe Properties Parallel. Everyone does a tiny bit work, so we accomplish huge tasks in seconds Flexible. Decentralized query evaluation, in seconds One aggregate can answer lots of questions. E.g. “where’s the nearest supply shed?” – the hierarchy encodes many answers in one tree!

Approved for Public Release, Distribution Unlimited July 21, 200427 Aggregation and Hierarchy Nearby information Maintained in more detail, can query it directly Changes seen sooner Remote information summarized High quality aggregated data This also changes as information evolves

Approved for Public Release, Distribution Unlimited July 21, 200428 Astrolabe summary Scalable: could support millions of machines Flexible: can easily extend domain hierarchy, define new columns or eliminate old ones. Adapts as conditions evolve. Secure: Uses keys for authentication and can even encrypt Handles firewalls gracefully, including issues of IP address re-use behind firewalls Performs well: updates propagate in seconds Cheap to run: tiny load, small memory impact

Approved for Public Release, Distribution Unlimited July 21, 200429 Bimodal Multicast A quick glimpse of scalable multicast Think about really large Internet configurations A data center as the data source Typical “publication” might be going to thousands of client systems

Approved for Public Release, Distribution Unlimited July 21, 200430 Swiss Stock Exchange Problem: Vsync. multicast is “fragile” Most members are healthy…. … but one is slow Most members are healthy….

Approved for Public Release, Distribution Unlimited July 21, 200431 Performance degrades as the system scales up 00.10.20.30.40.50.60.70.80.9 0 50 100 150 200 250 Virtually synchronous Ensemble multicast protocols perturb rate average throughput on nonperturbed members group size: 32 group size: 64 group size: 96 32 96

Approved for Public Release, Distribution Unlimited July 21, 200432 Why doesn’t multicast scale? With weak semantics… Faulty behavior may occur more often as system size increases (think “the Internet”) With stronger reliability semantics… Encounter a system-wide cost (e.g. membership reconfiguration, congestion control) That can be triggered more often as a function of scale (more failures, or more network “events”, or bigger latencies) Similar observation led Jim Gray to speculate that parallel databases scale as O(n 2 )

Approved for Public Release, Distribution Unlimited July 21, 200433 But none of this is inevitable Recent work on probabilistic solutions suggests that gossip-based repair strategy scales quite well Also gives very steady throughput And can take advantage of hardware support for multicast, if available

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

0.10.20.30.40.50.60.70.80.9 0 20 40 60 80 100 120 140 160 180 200 Low bandwidth comparison of pbcast performance at faulty and correct hosts perturb rate average throughput traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed host throughput for pbcast measured at perturbed host 0.10.20.30.40.50.60.70.80.9 0 20 40 60 80 100 120 140 160 180 200 High bandwidth comparison of pbcast performance at faulty and correct hosts perturb rate average throughput traditional: at unperturbed host pbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host This solves our problem! Bimodal Multicast rides out disturbances!

Approved for Public Release, Distribution Unlimited July 21, 200439 Bimodal Multicast Summary An extremely scalable technology Remains steady and reliable Even with high rates of message loss (in our tests as high as 20%) Even with large numbers of perturbed processes (we tested with up to 25%) Even with router failures Even when IP multicast fails And we’ve secured it using digital signatures

Approved for Public Release, Distribution Unlimited July 21, 200440 Kelips Third in our set of tools A P2P “index” Put(“name”, value) Get(“name”) Kelips can do lookups with one RPC, is self-stabilizing after disruption Unlike Astrolabe, nodes can put varying amounts of data out there.

Approved for Public Release, Distribution Unlimited July 21, 200441 Kelips 30 110 230202 Take a a collection of “nodes”

Approved for Public Release, Distribution Unlimited July 21, 200442 Kelips 012 30 110 230202 Affinity Groups: peer membership thru consistent hash 1N  N members per affinity group Map nodes to affinity groups

Approved for Public Release, Distribution Unlimited July 21, 200443 Kelips 012 30 110 230202 Affinity Groups: peer membership thru consistent hash 1N  Affinity group pointers N members per affinity group idhbeatrtt 3023490ms 23032230ms Affinity group view 110 knows about other members – 230, 30…

Approved for Public Release, Distribution Unlimited July 21, 200444 Kelips 012 30 110 230202 Affinity Groups: peer membership thru consistent hash 1N  Contact pointers N members per affinity group idhbeatrtt 3023490ms 23032230ms Affinity group view groupcontactNode …… 2202 Contacts 202 is a “contact” for 110 in group 2

Approved for Public Release, Distribution Unlimited July 21, 200445 Kelips 012 30 110 230202 Affinity Groups: peer membership thru consistent hash 1N  Gossip protocol replicates data cheaply N members per affinity group idhbeatrtt 3023490ms 23032230ms Affinity group view groupcontactNode …… 2202 Contacts resourceinfo …… dot.com110 Resource Tuples “dot.com” maps to group 2. So 110 tells group 2 to “route” inquiries about dot.com to it.

Approved for Public Release, Distribution Unlimited July 21, 200446 Kelips 012 30 110 230202 Affinity Groups: peer membership thru consistent hash 1N  N members per affinity group To look up “dot.com”, just ask some contact in group 2. It returns “110” (or forwards your request).

Approved for Public Release, Distribution Unlimited July 21, 200447 Kelips summary Split the system into  N subgroups Map (key,value) pairs to some subgroup, by hashing the key Replicate within that subgroup Each node tracks Its own group membership k members of each of the other groups To lookup a key, hash it and ask one or more of your contacts if they know the value

Approved for Public Release, Distribution Unlimited July 21, 200448 Kelips summary O(  N) storage overhead, which is higher than for other DHT’s Same space overhead for member list, contact list, and replicated data itself Heuristic is used to keep contacts fresh and avoid contacts that seem to churn This buys us O(1) lookup cost And background overhead is constant

Approved for Public Release, Distribution Unlimited July 21, 200449 Virtual Synchrony Last piece of the puzzle Outcome of a decade of DARPA-funded work, technology core of AEGIS “integrated” console New York and Swiss Stock Exchange French Air Traffic Control System Florida Electric Power and Light System

Approved for Public Release, Distribution Unlimited July 21, 200450 Virtual Synchrony Model

Approved for Public Release, Distribution Unlimited July 21, 200451 Roles in QuickSilver? Provides way for groups of components to Replicate data, synchronize Perform tasks in parallel (like parallel database lookups, for improved speed) Detect failures and reconfigure to compensate by regenerating lost functionality

Approved for Public Release, Distribution Unlimited July 21, 200452 Replication: Key to understanding QuickSilver KelipsVirtual Synchrony AstrolabeBimodal Multicast 012 30 110 230202 Gossip protocol tracks membership. Hash 1N  N members per affinity group query each member to an “affinity group” Gossip protocol replicates data cheaply

Approved for Public Release, Distribution Unlimited July 21, 200453 Metrics We plan to look at several: Robustness to externally imposed stress, overload: expect to demonstrate significant improvements Scalability: Graph performance/overheads as function of scale, load, etc End-user power: Implement JBI, sensor networks, data-center mgt. platform Total cost: With Raytheon, explore impact on real military applications Under DURIP funding we have acquired a clustered evaluation platform.

Approved for Public Release, Distribution Unlimited July 21, 200454 Our plan Integrate these core components Then Build a JBI layer over the system Integrate Johannes Gehrke’s data mining technology into the platform Support scalable overlay multicast (Francis) Raytheon: Teaming with us to tackle military applications, notably Navy

Approved for Public Release, Distribution Unlimited July 21, 200455 More information? www.cs.cornell.edu/Info/Projects/QuickSilver

Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes.

Similar presentations

Presentation on theme: "Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes.

Similar presentations

Presentation on theme: "Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes."— Presentation transcript:

Similar presentations

About project

Feedback