By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables
By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004

Acknowledgments Some of the following slides are adapted from the slides created by the authors of the paper

Outline Outline DHT Properties Viceroy Summary Structure
Routing Algorithm Join/Leave Bounding In-degree: Bucket Solution Fault Tolerance Summary First, we will discuss some performance properties of a DHT. Viceroy is designed with these concerns in mind. We will also see how Viceroy is different from other DHT with respect to these properties. Then we will discuss Viceroy, the structure of a viceroy network, how the nodes are connected to each other. the routing algorithm How to support nodes Join/leave. Bucket solution, an improvement on the primitive viceroy structure. Viceroy has no built in fault tolerance support, how can we add fault tolerance to it. And then summary.

DHT What’s DHT Examples Store (key, value) pairs Lookup Join/Leave
CAN, Pastry, Tapestry, Chord etc. A DHT can store (key, value) pairs, perform lookup of keys. Nodes can also join/leave the network freely, DHT must adjust itself on these events. We have seen several distributed hashing tables…

DHT Properties Dilation Maintenance cost Degree
Efficient lookup, usually O(log(n)) Maintenance cost Support dynamic environment Control messages, affected servers Degree Number of opened connections Servers impacted by node join/leave Heartbeat, graceful leave Here are some properties related to performance of a DHT. Dilation: we call it path length before, the number of hops a message is forward from source to a target. Obviously, we want low dilation. For a system to scale well, usually requires O(log(n)). Also we want low Maintenance cost: The cost of node join or leave. This is usually measured by the number of control messages transferred, the number of nodes affected by the change. Degree: a node has connections to other nodes, to forward message to them or receive request from them. The number of connections should be small. Maintaining too many connections is a burden for a node. If a server has many connections, then if the node failed or left, a lot of node need to update their state. Also for systems use heartbeat messages or implementing graceful leave, more connections means more messages to sent. The DHTs we have seen use O(log(n)) connections.

DHT Properties (cont.) Congestion:
Peers should share the routing load evenly Load (of a node): the probability that it is on a route with random source and destination. If path length = O(log(n)) then on average, each node is on n2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n2 = O(log(n))/n Peers should share the workload evenly. For this, we define load of a node to be the probability that it is on a route with random source and destination. If path length = O(lg(n)), then on average, each node must participate O(nlgn) routes. And the average load on a node is O(nlgn)/n2 = O(lgn/n)

Previous Works Here is table compares some DHTs with respect to the properties we have discussed. Note that Viceroy has the same dilation and congestion as Chord, tapestry, but only has a constant node degree, while Chord needs log(n).

Intuition Route is a combination of links of appropriate size Chrod
Chord: Each node has ALL log(n) links Viceroy Each node has ONE of the long-range links A link of length 1/2k points to a node has link of length 1/2k+1 Here is some intuitions that how Viceroy can use less edge and still can routing efficiently. A route is combination of links of different size. While in Chord, each node has all logn links, in viceroy each node only has one of the long-range links. To make sure the links can combined like this, a link of length 1/2k points to a node has link of length 1/2k+1. This k is correspond to the level of a node we will see in next slides. Chrod

A Butterfly Network Each node has ONE of the long-range links A link of length 1/2k points to a node has link of length 1/2k+1 Nodes “share” each other’s long link Routing Route to root Route to right group Route to right level Path: O(log(n)) Degree: O(1) 000 001 010 011 100 101 110 111 Level 1 Level 2 Level 3 Level 4 Each node has one of the long range link.

A Viceroy network Ideally, there should be log(n) levels
1 0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 Level 1 Level 2 Level 3 Ideally, there should be log(n) levels There is not a global counter Later, we will see how a node can estimate log(n) locally In this slides, we will brief though the different parts of a viceroy network. Here are the nodes of a viceroy network. They use a binary string as their id, the id’s are also mapped a real number between 0 and 1. The nodes are connected with its adjacent nodes. A node also a attribute called level. Roughly, a node has level k will has a link of length 1/2k, and it points to a node in next level. Ideally, the nodes are distributed in log(n) level. But note that there is a global counter that can counts the number of nodes in the network. Later, we will see that how a node can estimate this number locally. Ok, now the nodes are in different levels. And we have different kind of links. We will give the definitions of links later.

Structure: Nodes Node Order of ids Id: 128 bits binary string, u
Level: positive integer, u.level Order of ids b1b2…bk  ∑i=1…k bi/2i Each node has a SUCCESSOR and a PREDECESSOR SUCC(u), PRED(u) Node u stores the keys k such that u≤k<SUCC(u) SUCCESSOR is the minimal among the greater ids. It has a level which is a positive integer. Ids are mapped to read number, the nodes

Structure: Nodes x Keys stored on x Lemma 2.1
1 x SUCC(x) PRED(x) Keys stored on x Lemma 2.1 Let n0 = 1/d(x, SUCC(x)), then w.h.p. (i.e. p>1-1/n1+e) that log(n)-log(log(n))-O(1) <log(n0) ≤3log(n) Node x selects level from 1…log(n0) uniformly randomly Explain diagram. Lemma 2.1 shows how to estimate the size of the network. Through out this paper, by w.h.p., we mean that the probability is greater than 1-1/n to 1+e, where e is a position number. Then x select level from 1…log(n) randomly.

Structure: Links A node u in level k has six out links
2 x Short: SUCCESSOR ,PREDECESSOR 2 x Medium: (left) closest level-(k+1) node whose id matches u.id[k] and is smaller than u.id. 1 x Long: the closest level-(k+1) node with prefix u1…uk-1(1-uk)(?) u1…uk-1(1-uk)uk+1…uw* where w=log(n0)-log(log(n0)) 1 x Parent: closest level-(k-1) node Also keeps track of in-bound links A node links to its successor and predecessor, it also has a long link of length 1/2k. As we have seen, sometimes a node will want to use a near by node’s long link, to do so, it also has a parent link which points to a level-(k-1) node. It also has two medium link which links to level-(k+1). .

Structure: Links Short link Medium link Matches x[k]0* Matches 1*
0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 1 Short link Long link, cross over about 1/2k Matches u[w] except kth bit. (11*) Matches 1* Wrong! Level 1 Level 2 Level 3 Medium link Matches x[k]0* Here are examples of the links. Parent link, to level k-1

Routing: Algorithm Initialization: set cur to x
LOOKUP(x, y): Initialization: set cur to x Proceed to root: while cur.level > 1: cur = cur.parent Greedy search: if cur.id ≤ y < SUCC(cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat. Demo: The routing algorithm.

Routing: Example 0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 y x 1 Level 1 Level 2 Level 3 First route to a level-1 node, then do greedy routing. We can see the routing in this way, node x need to reach a node that is far away from itself, but it may not have a long enough link. What it can do is to ask a nearby node that has a long link to help it. Consider another example, routing form 0001 to 0101, this time, 0001’s link is too long. What it can do is by following short links, reach a node that has a link of appropriate length and that node can deliver the message for it. The idea here is that although a node only has a long-link, all its nearby neighbors as a whole set, will very likely to has links of any length,

One Observation 0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 1 Level 1 Level 2 Level 3 Here we can see that the second group as a whole, has link of two groups long, one groups long, and so on, Another example.

Routing: Analysis (1) 0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 y x 1 Level 1 Level 2 Level 3 Thus the routing can be divided into three parts, first route within the cluster of x, need lgn steps. Then traveling among the clusters, since every cluster has links of any size, it take lg(n/lgn) steps to reach the right cluster, Now, within the right cluster, it takes lgn steps to reach the destination.

Routing: Analysis (2) Expected path length = O(log(n))
log(n ) to `level-1’ node log(n ) for traveling among clusters log(n ) for final local search This is just what I said.

Routing: Theorems Theorem 4.4 Proof is based on several lemmas
The path length from x to y is O(log(n)) w.h.p. Proof is based on several lemmas Lemma 4.1 For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log2(n) w.h.p. Although the expected path length is O(lgn), there may be some paths of very large length. This theorem says that the bad situations occurs very rarely. The proof is based on some lemmas. Tonight we will not discuss them in detail.

Routing: Theorems (2) Lemma 4.2 m = log(n)-2loglog(n)-log(3+e)
In the greedy search phase of a lookup of value Y from node x, let the j’th greedy step vj, for 1 ≤ j ≤ m, be such that vj is more than O(log2(n)) nodes away from y. Then w.h.p. node vj is reached over a Medium or Long link, and hence satisfies vj.level = j and vj[j] = Y[j]. m = log(n)-2loglog(n)-log(3+e) W.h.p. within m steps, we are n/2m = 6log2(n) nodes away from the destination

Routing: Theorems (3) Lemma 4.3 Theorem 4.4 Theorem 4.6 Theorem 4.7
Let v be a node that is O(log2(n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v. Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p. Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log2(n)/n w.h.p. Theorem 4.7 Every node u has in-degree O(log(n)) w.h.p. About the load balance and in-degrees. asymptotical

Join: Algorithm Choose identifier: select a random 128 bits x1x2…x128
Setup short links: invoke LOOKUP(x), let x’ be the result node. Insert x between x’ and x’.SUUCESSOR. Choose level: let k be the maximal number of matching prefix bits between x and either SUCC(x) or PRED(x), choose level from 1…k. Set parent link: If SUCC(x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC(x) and repeat. Set long link: p = x1…xk-1(1-xk)xk+1…xw Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p is reached. Now we see how join works. These are copy from the paper. We will see an example.

Join: Algorithm (cont.)
6. Set medium links: Denote p = x1x2…xx.level. If SUCC(x) has prefix p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat. 7. Set inbound links: Denote p = x1x2…xx.level. Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x. Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y. Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.

Join: Example X Set Medium link: O(lg2n) w.h.p p = x1x2…xk (01)
If y[k] != p: stop If y[k]=p and y.level=k+1: set Medium link Otherwise, move to succ(y) STOP Set long link P = x1…xk-1(1-xk)…xw stop at level k+1? In this case, find 00* Join: Example Set Parent link: Following SUCC link, find a node has level k-1. 0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 0111 0111 1 Set inbound long links: Following short links, find y such that y[k]=x[k] and y.level = x.level, check y’s inbound links. X Lookup(x) Level 1 Level 2 Level 3 Set medium link. Let p be x’s prefix of length k, we follow the short links, examine the node one by one, if

Join: Analysis LOOKUP takes O(log(n)) messages w.h.p.
Travels on short links during link setting phase is O(lg2n) w.h.p. A Medium link is within 6log2(n) nodes from x w.h.p. Similar for others Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log2(n)) messages w.h.p. The expected number of nodes that change their state as a result of x’s join is constant, and w.h.p is O(log(n)). Because node x has O(log(n)) in-degrees w.h.p. Similar results holds for LEAVE.

Bounding In-degrees Theorem 4.7
Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p. In-degree=# of servers affected by join/leave How to guarantee constant in-degree? Bucket solution A background process to balance the assignment of levels

Bucket Solution: Intuition
~log(n) x Level k-1 In consecutive log(n) nodes, there are some nodes at each level and there is not too much nodes at one level. The bucket solution is to improve the level selection procedure, to avoid such extremely unbalanced cases. Level k Node x has log(n) in-degree, assuming Medium Right Too many nodes at level k-1; Too few nodes at level k Improve the level selection procedure

Bucket Solution The name space is divided into non-overlapped buckets.
0001 0010 0011 0100 0101 0110 1000 1001 1011 1101 1110 1111 The name space is divided into non-overlapped buckets. A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2. In a buckets, levels are NOT assigned randomly For each 1≤j≤log(n), there are 1…c nodes at level j in each bucket In(x) < 7c (?? 2c) 1

Maintaining Bucket Size
n can be accurately estimated When bucket size exceeds clog(n), the bucket is split into two equal size buckets. When bucket size drops below log(n), it is merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n)x(2c+2)/3, the new bucket is split into two buckets. (c+1)/3 > 1 since c>2 Buckets are organized into a ring, which can be merged or split with O(1) message.

Maintain Level Property
Node join/leave without merging or splitting O(1) Join: size < clog(n), choose a level that has less that c nodes Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them. Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n)) Merging/splitting are expensive, but they do not happen very often After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed Amortized Overhead = c/((c-2)/3) = O(1) for c>2 First we look at the case that a node join/leave does not trigger out a bucket merge or split. If a node joins in, since no split happens, which means that the size of bucket is still less than clogn, we can find a level that has less than c nodes, and assign that level to the new node. If a node leaves, and it is the only node in its level, then this will cause an empty level. We need to find a level that has at least two nodes in it, and put one of them to the empty level. Since the bucket’s size is greater then log(n), we are guaranteed to have such a level. After a merge or split operation, the level property may be totally screwed up, we need to reassign levels to all of them, this will need O(log(n)) messages. Which is expensive. The good thing is, the expensive operations can’t happen too often. In fact,

Amortized analysis d1, d2 > (c-2)/3 Max bucket size New bucket size
Log(n) clog(n) d1 d2 d1, d2 > (c-2)/3 New bucket size Max bucket size min(c/2lgn, (c+1)/3lgn) max(c/2lgn, (2c+2)/3lgn) Blue bar is a new bucket, its size is within the two green lines. If the blue bar grows or shrinks to the red lines, then it will trigger a splitting or merging. Since the distance from green lines to red lines is greater than (c-2)/3. Spitting or merging rarely happens.

Fault Tolerance Viceroy has no built in support for fault tolerance
Viceroy requires graceful leave Leaves are NOT the same as failures Performance is sensitive to failure External techniques: Thickening Edges State Machine Replication

State Machine Replication
Old New SMR I copied this slide from the author’s presentation. To my understand, this works like super node. Viceroy nodes Super node

Related Works De Bruijn Graph Based Network Others Distance halving
D2B Koorde Others Symphony (Small world model) Ulysses (ButterFly, log(n), log(n)/loglogn)

Summary Constant out-degree Expected constant in-degree
O(log(n )) w.h.p. O(1) with bucket solution O(log(n )) path length w.h.p Expected log(n )/n load: O(log2(n)/n) w.h.p. Weakness/improvements: Not Locality Aware No Fault Tolerance Support Due to the lack of flexibility of ButterFly network Locality [lo kai lity]

Question Photo by Peter J. Bryant

By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003

Similar presentations

Presentation on theme: "By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003

Similar presentations

Presentation on theme: "By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003"— Presentation transcript:

Similar presentations

About project

Feedback