CS5412: Bimodal Multicast Astrolabe

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Advertisements

HIERARCHY REFERENCING TIME SYNCHRONIZATION PROTOCOL Prepared by : Sunny Kr. Lohani, Roll – 16 Sem – 7, Dept. of Comp. Sc. & Engg.
Optimizing Buffer Management for Reliable Multicast Zhen Xiao AT&T Labs – Research Joint work with Ken Birman and Robbert van Renesse.
Ýmir Vigfússon IBM Research Haifa Labs Ken Birman Cornell University Qi Huang Cornell University Deepak Nataraj Cornell University.
Small-world Overlay P2P Network
Using Gossip to Build Scalable Services Ken Birman, CS514 Dept. of Computer Science Cornell University.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
Ken Birman Cornell University. CS5410 Fall
Epidemic Techniques Chiu Wah So (Kelvin). Database Replication Why do we replicate database? – Low latency – High availability To achieve strong (sequential)
Spring Routing & Switching Umar Kalim Dept. of Communication Systems Engineering 06/04/2007.
P2P Course, Structured systems 1 Introduction (26/10/05)
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
EPIDEMIC TECHNIQUES Ki Suh Lee. OUTLINE Epidemic Protocol Epidemic Algorithms for Replicated Database Maintenance Astrolabe: A Robust and scalable technology.
Communication (II) Chapter 4
5.1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
Multicast Routing Algorithms n Multicast routing n Flooding and Spanning Tree n Forward Shortest Path algorithm n Reversed Path Forwarding (RPF) algorithms.
Network Layer4-1 Chapter 4: Network Layer r 4. 1 Introduction r 4.2 Virtual circuit and datagram networks r 4.3 What’s inside a router r 4.4 IP: Internet.
CS1Q Computer Systems Lecture 17 Simon Gay. Lecture 17CS1Q Computer Systems - Simon Gay2 The Layered Model of Networks It is useful to think of networks.
Chi-Cheng Lin, Winona State University CS 313 Introduction to Computer Networking & Telecommunication Chapter 5 Network Layer.
Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.
CS5412: BIMODAL MULTICAST ASTROLABE Ken Birman CS5412 Spring Lecture XIX.
Leiden; Dec 06Gossip-Based Networking Workshop1 Epidemic Algorithms and Emergent Shape Ken Birman.
2007/1/15http:// Lightweight Probabilistic Broadcast M2 Tatsuya Shirai M1 Dai Saito.
Scalable Self-Repairing Publish/Subscribe Robbert van Renesse Ken Birman Werner Vogels Cornell University.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
CS5412: BIMODAL MULTICAST ASTROLABE Ken Birman Gossip-Based Networking Workshop 1 Lecture XIX Leiden; Dec 06.
BIMODAL MULTICAST ASTROLABE Ken Birman 1 CS6410. Gossip  Recall from early in the semester that gossip spreads in log(system size) time  But is.
William Stallings Data and Computer Communications
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
CS 5565 Network Architecture and Protocols
Multi Node Label Routing – A layer 2.5 routing protocol
Please take out the homework - viewing sheet fro the movie
CS 3700 Networks and Distributed Systems
Lecture 2: Leaf-Spine and PortLand Networks
Multicast Outline Multicast Introduction and Motivation DVRMP.
(How the routers’ tables are filled in)
PROTEAN: A Scalable Architecture for Active Networks
Wireless Sensor Network Architectures
CHAPTER 3 Architectures for Distributed Systems
CS 3700 Networks and Distributed Systems
Understanding the OSI Reference Model
IS3120 Network Communications Infrastructure
Net 323: NETWORK Protocols
Using Gossip to Build Scalable Services
THE NETWORK LAYER.
Intra-Domain Routing Jacob Strauss September 14, 2006.
Routing: Distance Vector Algorithm
CS5412: Bimodal Multicast Astrolabe
Routing.
Host Multicast: A Framework for Delivering Multicast to End Users
湖南大学-信息科学与工程学院-计算机与科学系
Transport Layer Unit 5.
CS 4700 / CS 5700 Network Fundamentals
What’s “Inside” a Router?
TRUST:Team for Research in Ubiquitous Secure Technologies
CS514: Intermediate Course in Operating Systems
CS 4700 / CS 5700 Network Fundamentals
PRESENTATION COMPUTER NETWORKS
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
ECE 544 Project III Description and Timeline March 23, 2018
Distributed Systems CS
ECE 544 Software Project 3: Description and Timeline
CS5412: Using Gossip to Build Overlay Networks
EE 122: Lecture 22 (Overlay Networks)
Routing and the Network Layer (ref: Interconnections by Perlman
Other Routing Protocols
Routing.
Distributed Systems CS
Presentation transcript:

CS5412: Bimodal Multicast Astrolabe CS5412 Spring 2016 CS5412: Bimodal Multicast Astrolabe Lecture XIX Ken Birman

Gossip 201 Recall from early in the semester that gossip spreads in log(system size) time But is this actually “fast”? 1.0 % infected 0.0 Time  CS5412 Spring 2016

Gossip in distributed systems Log(N) can be a very big number! With N=100,000, log(N) would be 12 So with one gossip round per five seconds, information needs one minute to spread in a large system! Some gossip protocols combine pure gossip with an accelerator A good way to get the word out quickly CS5412 Spring 2016

Bimodal Multicast To send a message, this protocol uses IP multicast We just transmit it without delay and we don’t expect any form of responses Not reliable, no acks No flow control (this can be an issue) In data centers that lack IP multicast, can simulate by sending UDP packets 1:1 without acks CS5412 Spring 2016

What’s the cost of an IP multicast? In principle, each Bimodal Multicast packet traverses the relevant data center links and routers just once per message So this is extremely cheap... but how do we deal with systems that didn’t receive the multicast? CS5412 Spring 2016

Making Bimodal Multicast reliable We can use gossip! Every node tracks the membership of the target group (using gossip, just like with Kelips, the DHT we studied early in the semester) Bootstrap by learning “some node addresses” from some kind of a server or web page But then exchange of gossip used to improve accuracy CS5412 Spring 2016

Making Bimodal Multicast reliable Now, layer in a gossip mechanism that gossips about multicasts each node knows about Rather than sending the multicasts themselves, the gossip messages just talk about “digests”, which are lists Node A might send node B I have messages 1-18 from sender X I have message 11 from sender Y I have messages 14, 16 and 22-71 from sender Z Compactly represented... This is a form of “push” gossip CS5412 Spring 2016

Making Bimodal Multicast reliable On receiving such a gossip message, the recipient checks to see which messages it has that the gossip sender lacks, and vice versa Then it responds I have copies of messages M, M’and M’’ that you seem to lack I would like a copy of messages N, N’ and N’’ please An exchange of the actual messages follows CS5412 Spring 2016

Optimizations Bimodal Multicast resends using IP multicast if there is “evidence” that a few nodes may be missing the same thing E.g. if two nodes ask for the same retransmission Or if a retransmission shows up from a very remote node (IP multicast doesn’t always work in WANs) It also prioritizes recent messages over old ones Reliability has a “bimodal” probability curve: either nobody gets a message or nearly everyone does CS5412 Spring 2016

lpbcast variation In this variation on Bimodal Multicast instead of gossiping with every node in a system, we modify the Bimodal Multicast protocol It maintains a “peer overlay”: each member only gossips with a smaller set of peers picked to be reachable with low round-trip times, plus a second small set of remote peers picked to ensure that the graph is very highly connected and has a small diameter Called a “small worlds” structure by Jon Kleinberg Lpbcast is often faster, but equally reliable! CS5412 Spring 2016

Speculation... about speed When we combine IP multicast with gossip we try to match the tool we’re using with the need Try to get the messages through fast... but if loss occurs, try to have a very predictable recovery cost Gossip has a totally predictable worst-case load This is appealing at large scales How can we generalize this concept? CS5412 Spring 2016

A thought question What’s the best way to Options to consider Count the number of nodes in a system? Compute the average load, or find the most loaded nodes, or least loaded nodes? Options to consider Pure gossip solution Construct an overlay tree (via “flooding”, like in our consistent snapshot algorithm), then count nodes in the tree, or pull the answer from the leaves to the root… CS5412 Spring 2016

… and the answer is Gossip isn’t very good for some of these tasks! There are gossip solutions for counting nodes, but they give approximate answers and run slowly Tricky to compute something like an average because of “re-counting” effect, (best algorithm: Kempe et al) On the other hand, gossip works well for finding the c most loaded or least loaded nodes (constant c) Gossip solutions will usually run in time O(log N) and generally give probabilistic solutions CS5412 Spring 2016

Yet with flooding… easy! Recall how flooding works Basically: we construct a tree by pushing data towards the leaves and linking a node to its parent when that node first learns of the flood Can do this with a fixed topology or in a gossip style by picking random next hops 3 2 Labels: distance of the node from the root 1 3 2 3 CS5412 Spring 2016

This is a “spanning tree” Once we have a spanning tree To count the nodes, just have leaves report 1 to their parents and inner nodes count the values from their children To compute an average, have the leaves report their value and the parent compute the sum, then divide by the count of nodes To find the least or most loaded node, inner nodes compute a min or max… Tree should have roughly log(N) depth, but once we build it, we can reuse it for a while CS5412 Spring 2016

Not all logs are identical! When we say that a gossip protocol needs time log(N) to run, we mean log(N) rounds And a gossip protocol usually sends one message every five seconds or so, hence with 100,000 nodes, 60 secs But our spanning tree protocol is constructed using a flooding algorithm that runs in a hurry Log(N) depth, but each “hop” takes perhaps a millisecond. So with 100,000 nodes we have our tree in 12 ms and answers in 24ms! CS5412 Spring 2016

Insight? Gossip has time complexity O(log N) but the “constant” can be rather big (5000 times larger in our example) Spanning tree had same time complexity but a tiny constant in front But network load for spanning tree was much higher In the last step, we may have reached roughly half the nodes in the system So 50,000 messages were sent all at the same time! CS5412 Spring 2016

Gossip vs “Urgent”? With gossip, we have a slow but steady story We know the speed and the cost, and both are low A constant, low-key, background cost And gossip is also very robust Urgent protocols (like our flooding protocol, or 2PC, or reliable virtually synchronous multicast) Are way faster But produce load spikes And may be fragile, prone to broadcast storms, etc CS5412 Spring 2016

Introducing hierarchy One issue with gossip is that the messages fill up With constant sized messages… … and constant rate of communication … we’ll inevitably reach the limit! Can we inroduce hierarchy into gossip systems? CS5412 Spring 2016

Astrolabe Intended as help for applications adrift in a sea of information Structure emerges from a randomized gossip protocol This approach is robust and scalable even under stress that cripples traditional systems Developed at RNS, Cornell By Robbert van Renesse, with many others helping… Technology was adopted at Amazon.com (but they build their own solutions rather than using it in this form) CS5412 Spring 2016

Astrolabe is a flexible monitoring overlay Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1971 1.5 4.1 cardinal 2004 4.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2271 1.8 1 6.2 falcon 1971 1.5 4.1 cardinal 2004 4.5 6.0 swift.cs.cornell.edu Periodically, pull data from monitored systems Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 1 6.2 falcon 1976 2.7 4.1 cardinal 2231 1.7 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

Astrolabe in a single domain Each node owns a single tuple, like the management information base (MIB) Nodes discover one-another through a simple broadcast scheme (“anyone out there?”) and gossip about membership Nodes also keep replicas of one-another’s rows Periodically (uniformly at random) merge your state with some else… CS5412 Spring 2016

State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1971 1.5 4.1 cardinal 2004 4.5 6.0 swift.cs.cornell.edu Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1971 1.5 4.1 cardinal 2004 4.5 6.0 swift.cs.cornell.edu swift 2011 2.0 cardinal 2201 3.5 Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1971 1.5 4.1 cardinal 2201 3.5 6.0 swift.cs.cornell.edu Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

Observations Merge protocol has constant cost One message sent, received (on avg) per unit time. The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size In Astrolabe, we limit them to 50-100 rows CS5412 Spring 2016

Big systems… A big system could have many regions Looks like a pile of spreadsheets A node only replicates data from its neighbors within its own region CS5412 Spring 2016

Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide SQL query “summarizes” data Name Avg Load WL contact SMTP contact SF 2.2 123.45.61.3 123.45.61.17 NJ 1.6 127.16.77.6 127.16.77.11 Paris 2.7 14.66.71.8 14.66.71.12 Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version … swift 1.7 1 6.2 falcon 2.1 4.1 cardinal 3.9 6.0 Name Load Weblogic? SMTP? Word Version … swift 2.0 1 6.2 falcon 1.5 4.1 cardinal 4.5 6.0 Name Load Weblogic? SMTP? Word Version … swift 2.0 1 6.2 falcon 1.5 4.1 cardinal 4.5 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 4.5 zebra 3.2 1 6.2 gnu .5 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 4.5 zebra 3.2 1 6.2 gnu .5 Name Load Weblogic? SMTP? Word Version … gazelle 4.1 4.5 zebra 0.9 1 6.2 gnu 2.2 New Jersey San Francisco CS5412 Spring 2016

Large scale: “fake” regions These are Computed by queries that summarize a whole region as a single row Gossiped in a read-only manner within a leaf region But who runs the gossip? Each region elects “k” members to run gossip at the next level up. Can play with selection criteria and “k” CS5412 Spring 2016

Hierarchy is virtual… data is replicated Yellow leaf node “sees” its neighbors and the domains on the path to the root. Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Gnu runs level 2 epidemic because it has lowest load Falcon runs level 2 epidemic because it has lowest load Name Load Weblogic? SMTP? Word Version … swift 2.0 1 6.2 falcon 1.5 4.1 cardinal 4.5 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 4.5 zebra 3.2 1 6.2 gnu .5 New Jersey San Francisco CS5412 Spring 2016

Hierarchy is virtual… data is replicated Green node sees different leaf domain but has a consistent view of the inner domain Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version … swift 2.0 1 6.2 falcon 1.5 4.1 cardinal 4.5 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 4.5 zebra 3.2 1 6.2 gnu .5 New Jersey San Francisco CS5412 Spring 2016

Worst case load? A small number of nodes end up participating in O(logfanoutN) epidemics Here the fanout is something like 50 In each epidemic, a message is sent and received roughly every 5 seconds We limit message size so even during periods of turbulence, no message can become huge. CS5412 Spring 2016

Who uses Astrolabe? Amazon doesn’t use Astrolabe in this identical form, but they built gossip-based monitoring systems based on the same ideas. They deploy these in S3 and EC2: throughout their big data centers! For them, Astrolabe-like mechanisms track overall state of their system to diagnose performance issues They also automate reaction to temporary overloads CS5412 Spring 2016

Example of overload handling Some service S is getting slow… Astrolabe triggers a “system wide warning” Everyone sees the picture “Oops, S is getting overloaded and slow!” So everyone tries to reduce their frequency of requests against service S What about overload in Astrolabe itself? Could everyone do a fair share of inner aggregation? CS5412 Spring 2016

Idea that one company had Start with Astrolabe approach But instead of electing nodes to play inner roles, just assign them roles, left to right N-1 inner nodes, hence N-1 nodes play 2 aggre- gation roles and one lucky node just has one role What impact will this have on Astrolabe? CS5412 Spring 2016

World’s worst aggregation tree!  D L B J F N A C E G I K M O An event e occurs at H P learns O(N) time units later! G gossips with H and learns e A B C D E F G H I J K L M N O P CS5412 Spring 2016

What went wrong? In this horrendous tree, each node has equal “work to do” but the information-space diameter is larger! Astrolabe benefits from “instant” knowledge because the epidemic at each level is run by someone elected from the level below CS5412 Spring 2016

Insight: Two kinds of shape We’ve focused on the aggregation tree But in fact should also think about the information flow tree CS5412 Spring 2016

Information space perspective Bad aggregation graph: diameter O(n) Astrolabe version: diameterO(log(n)) H – G – E – F – B – A – C – D – L – K – I – J – N – M – O – P A – B C – D E – F G – H I – J K – L M – N O – P CS5412 Spring 2016

Where will this approach go next? Cornell research: Shared State Table (SST) Idea is to use RDMA to implement a super-fast shared state table for use in cloud data centers Tradeoff: simplified model but WAY faster CS5412 Spring 2016

SST details Starts like Astrolabe with a table for a region Name Load Weblogic? SMTP? Word Version … gazelle 1.7 4.5 zebra 3.2 1 6.2 gnu .5 SST details Starts like Astrolabe with a table for a region But no hierarchy, SST is like a “one level” Astrolabe And the implementation doesn’t use gossip…. To understand SST, you need to know about RDMA This is a new kind of hardware Allows DMA transfers over the optical network from sender memory to receiver memory, at full network data rates which can be 10Gbps or even 100Gbpsm CS5412 Spring 2016

RDMA Concept RDMA is a new standard for Reliable DMA Memory Region Bits moved by the optical network directly from sender memory to receiver memory Sender Node Receiver Node CS5412 Spring 2016

RDMA acronyms: Enliven a party… RDMA: Short for Remote Direct Memory Access.. Offers two modes: Block transfers (can be very large): binary from sender to destination; sender specifies source and destination provides a receive buffer One-sided read: Receiver allows sender to do a read from memory without any software action on receiver side. One-sided write: same idea, but allows writes Three ways to obtain RDMA: RDMA on Infiniband network hardware: popular in HPC RDMA on fast Ethernet: also called RDMA/ROCE. Very new and we need experience to know how well this will work. Issue centers on a new approach to flow control. RDMA in software: SoftROCE. A fake RDMA for running RDMA code on old machines lacking RDMA hardware. Tunnels on TCP IB Verbs: this is the name for the software library used to access RDMA CS5412 Spring 2016

SST uses RDMA to replicate rows With SST, when a node updates its row, RDMA is used to copy the row to other nodes Then the compute model offers predicates on the table, aggregation, and upcalls to event handlers As many as 1M or 2M, per node, per second This is so much faster than Astrolabe that comparison is kind of meaningless… maybe 100,000x faster… CS5412 Spring 2016

SST: Like Astrolabe, but one region, fast updates Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 1 6.2 falcon 1971 1.5 4.1 cardinal 2004 4.5 6.0 swift.cs.cornell.edu Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 1 6.2 falcon 1976 2.7 4.1 cardinal 2201 3.5 6.0 CS5412 Spring 2016 cardinal.cs.cornell.edu

SST Predicate with Upcall Basically, each program registers code like this: if (some-predicate-becomes true) { do this “upcall” logic } SST evaluates predicates and issues upcalls in parallel, on all machines, achieving incredible speed After machine A changes some value, a predicate on machine B might fire in 1us or less delay… CS5412 Spring 2016

Use case comparison Kind of slow Astrolabe SST Kind of slow Used at large scale to monitor big data centers, for “data mining” Uses hierarchical structure to scale, gossip as its communication protocol Crazy fast Used mostly at rack scale: maybe 64 or 128 nodes at a time Has a flat structure, and uses RDMA to transfer rows directly at optical network speeds of 10 or even 100Gbps CS5412 Spring 2016

Summary First we saw a way of using Gossip in a reliable multicast (although the reliability is probabilistic) Astrolabe uses Gossip for aggregation Pure gossip isn’t ideal for this… and competes poorly with flooding and other urgent protocols But Astrolabe introduces hierarchy and is an interesting option that gets used in at least one real cloud platform SST uses a similar model but Implemented with one-sided RDMA reads, writes Less scalable, but runs 100,000x faster… CS5412 Spring 2016