Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by.

Slides:



Advertisements
Similar presentations
CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Scalable Content-Addressable Network Lintao Liu
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
The Chord P2P Network Some slides have been borowed from the original presentation by the authors.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Online Social Networks and Media Navigation in a small world.
Fabian Kuhn, Microsoft Research, Silicon Valley
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Fault-tolerant Routing in Peer-to-Peer Systems James Aspnes Zoë Diamadi Gauri Shah Yale University PODC 2002.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Peer-to-Peer.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Viceroy: A scalable and dynamic emulation of the Butterfly Presented in CS294-4 by Sailesh Krishnamurthy Sep 22, 2003.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
P2P Course, Structured systems 1 Skip Net (9/11/05)
P2P Course, Structured systems 1 Introduction (26/10/05)
File Sharing : Hash/Lookup Yossi Shasho (HW in last slide) Based on Chord: A Scalable Peer-to-peer Lookup Service for Internet ApplicationsChord: A Scalable.
1 Koorde: A Simple Degree Optimal DHT Frans Kaashoek, David Karger MIT Brought to you by the IRIS project.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Optimisation des DHT à partir des propriétés physiques, logiques et sociologiques des clients Pierre Fraigniaud CNRS LRI, Univ. Paris-Sud
Searching via Your Neighbor’s Neighbor: The Power of Lookahead in P2P Networks Moni Naor Udi Wieder The Weizmann Institute of Science Gurmeet Manku Stanford.
Hot Topics in Peer-to-Peer Computing (HOT-P2P 2004) Volendam 08 October 2004 Non-uniform deterministic routing on F-Chord(  ) Gennaro Cordasco, Luisa.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Online Social Networks and Media
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Dynamic Networks for Peer-to-Peer Systems Pierre Fraigniaud CNRS LRI, Univ. Paris Sud Joint work with Philippe Gauron.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
Chord Advanced issues. Analysis Theorem. Search takes O (log N) time (Note that in general, 2 m may be much larger than N) Proof. After log N forwarding.
Lecture 12 Distributed Hash Tables CPE 401/601 Computer Network Systems slides are modified from Jennifer Rexford.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
Chord Advanced issues. Analysis Search takes O(log(N)) time –Proof 1 (intuition): At each step, distance between query and peer hosting the object reduces.
Dynamic Networks for Peer-to-Peer Systems Pierre Fraigniaud CNRS Lab. de Recherche en Informatique (LRI) Univ. Paris-Sud, Orsay Joint work with Philippe.
BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
CS694 - DHT1 Distributed Hash Table Systems Hui Zhang University of Southern California.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
The Chord P2P Network Some slides taken from the original presentation by the authors.
Peer-to-Peer Networks 07 Degree Optimal Networks
Pastry Scalable, decentralized object locations and routing for large p2p systems.
CSE 486/586 Distributed Systems Distributed Hash Tables
The Chord P2P Network Some slides have been borrowed from the original presentation by the authors.
Distributed Hash Tables
S-Chord: Using Symmetry to Improve Lookup Efficiency in Chord
Peer-to-Peer Networks 07 Degree Optimal Networks
Know thy Neighbor’s Neighbor Better Routing for Skip Graphs and Small Worlds Moni Naor Udi Wieder.
Peer-to-Peer and Social Networks
Accessing nearby copies of replicated objects
SKIP GRAPHS James Aspnes Gauri Shah SODA 2003.
DHT Routing Geometries and Chord
Chord Advanced issues.
Chord Advanced issues.
Chord Advanced issues.
Koorde: A Simple Degree Optimal DHT
By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003
P2P: Distributed Hash Tables
CSE 486/586 Distributed Systems Distributed Hash Tables
A Scalable Peer-to-peer Lookup Service for Internet Applications
Presentation transcript:

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004

Acknowledgments Some of the following slides are adapted from the slides created by the authors of the paper

Outline DHT Properties Viceroy Structure Routing Algorithm Join/Leave Bounding In-degree: Bucket Solution Fault Tolerance Summary

DHT What ’ s DHT Store (key, value) pairs Lookup Join/Leave Examples CAN, Pastry, Tapestry, Chord etc.

DHT Properties Dilation Efficient lookup, usually O(log(n)) Maintenance cost Support dynamic environment Control messages, affected servers Degree Number of opened connections Servers impacted by node join/leave Heartbeat, graceful leave

DHT Properties (cont.) Congestion: Peers should share the routing load evenly Load (of a node): the probability that it is on a route with random source and destination. If path length = O(log(n)) then on average, each node is on n 2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n 2 = O(log(n))/n

Previous Works

Intuition Route is a combination of links of appropriate size Chord: Each node has ALL log(n) links Viceroy Each node has ONE of the long-range links A link of length 1/2 k points to a node has link of length 1/2 k+1 Chrod

Level 1 Level 2 Level 3 Level 4 A Butterfly Network  Each node has ONE of the long-range links  A link of length 1/2 k points to a node has link of length 1/2 k+1  Nodes “ share ” each other ’ s long link  Routing 1.Route to root 2.Route to right group 3.Route to right level Path: O(log(n)) Degree: O(1)

A Viceroy network Level 1 Level 2 Level 3 Ideally, there should be log(n) levels There is not a global counter Later, we will see how a node can estimate log(n) locally

Structure: Nodes Node Id: 128 bits binary string, u Level: positive integer, u.level Order of ids b 1 b 2 … b k  ∑ i=1 … k b i /2 i Each node has a SUCCESSOR and a PREDECESSOR SUCC(u), PRED(u) Node u stores the keys k such that u≤k< SUCC (u)

Structure: Nodes 01 x SUCC (x) PRED (x) Keys stored on x  Lemma 2.1 Let n 0 = 1/d(x, SUCC (x)), then w.h.p. (i.e. p>1-1/n 1+e ) that log(n)-log(log(n))-O(1) <log(n 0 ) ≤ 3log(n)  Node x selects level from 1 … log(n 0 ) uniformly randomly

Structure: Links A node u in level k has six out links 2 x Short: SUCCESSOR,PREDECESSOR 2 x Medium: (left) closest level-(k+1) node whose id matches u.id[k] and is smaller than u.id. 1 x Long: the closest level-(k+1) node with prefix u 1 … u k-1 (1-u k )(?) u 1 … u k-1 (1-u k )u k+1 … u w * where w=log( n0)-log(log(n0)) 1 x Parent: closest level-(k-1) node Also keeps track of in-bound links

Structure: Links Level 1 Level 2 Level Short link Parent link, to level k-1 Medium link Matches x[k]0* Long link, cross over about 1/2 k Matches u[w] except k th bit. (11*) Matches 1* Wrong!

Routing: Algorithm LOOKUP (x, y): Initialization: set cur to x Proceed to root: while cur.level > 1: cur = cur.parent Greedy search: if cur.id ≤ y < SUCC (cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat. Demo:

Routing: Example Level 1 Level 2 Level y x

One Observation Level 1 Level 2 Level

Routing: Analysis (1) Level 1 Level 2 Level y x

Routing: Analysis (2) Expected path length = O(log(n)) log(n ) to `level-1’ node log(n ) for traveling among clusters log(n ) for final local search

Routing: Theorems Theorem 4.4 The path length from x to y is O(log(n)) w.h.p. Proof is based on several lemmas Lemma 4.1 For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log 2 (n) w.h.p.

Routing: Theorems (2) Lemma 4.2 In the greedy search phase of a lookup of value Y from node x, let the j ’ th greedy step v j, for 1 ≤ j ≤ m, be such that v j is more than O(log 2 (n)) nodes away from y. Then w.h.p. node v j is reached over a Medium or Long link, and hence satisfies v j.level = j and v j [j] = Y[j]. m = log(n)-2loglog(n)-log(3+e) W.h.p. within m steps, we are n/2 m = 6log 2 (n) nodes away from the destination

Routing: Theorems (3) Lemma 4.3 Let v be a node that is O(log 2 (n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v. Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p. Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log 2 (n)/n w.h.p. Theorem 4.7 Every node u has in-degree O(log(n)) w.h.p.

Join: Algorithm 1. Choose identifier: select a random 128 bits x 1 x 2 … x Setup short links: invoke LOOKUP (x), let x ’ be the result node. Insert x between x ’ and x ’. SUUCESSOR. 3. Choose level: let k be the maximal number of matching prefix bits between x and either SUCC (x) or PRED (x), choose level from 1 … k. 4. Set parent link: If SUCC (x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC (x) and repeat. 5. Set long link: p = x 1 … x k-1 (1-x k )x k+1 … x w Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p is reached.

Join: Algorithm (cont.) 6. Set medium links: Denote p = x 1 x 2 … x x.level. If SUCC (x) has prefix p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat. 7. Set inbound links: Denote p = x 1 x 2 … x x.level. Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x. Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y. Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.

Join: Example Level 1 Level 2 Level Lookup(x) 0111 Set Medium link: O(lg 2 n) w.h.p p = x 1 x 2 … x k (01) If y[k] != p: stop If y[k]=p and y.level=k+1: set Medium link Otherwise, move to succ(y) STOP Set Parent link: Following SUCC link, find a node has level k-1. Set long link P = x 1 … x k-1 (1-x k ) … x w stop at level k+1? In this case, find 00* Set inbound long links: Following short links, find y such that y[k]=x[k] and y.level = x.level, check y ’ s inbound links. X

Join: Analysis LOOKUP takes O(log(n)) messages w.h.p. Travels on short links during link setting phase is O(lg 2 n) w.h.p. A Medium link is within 6log 2 (n) nodes from x w.h.p. Similar for others Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log 2 (n)) messages w.h.p. The expected number of nodes that change their state as a result of x ’ s join is constant, and w.h.p is O(log(n)). Because node x has O(log(n)) in-degrees w.h.p. Similar results holds for LEAVE.

Bounding In-degrees Theorem 4.7 Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p. In-degree=# of servers affected by join/leave How to guarantee constant in-degree? Bucket solution A background process to balance the assignment of levels

Bucket Solution: Intuition Level k-1 Level k Node x has log(n) in-degree, assuming Medium Right ~log(n) x Too many nodes at level k-1; Improve the level selection procedure Too few nodes at level k

Bucket Solution  The name space is divided into non- overlapped buckets.  A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2.  In a buckets, levels are NOT assigned randomly  For each 1≤j≤log(n), there are 1 … c nodes at level j in each bucket  In(x) < 7c (?? 2c)

Maintaining Bucket Size n can be accurately estimated When bucket size exceeds clog(n), the bucket is split into two equal size buckets. When bucket size drops below log(n), it is merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n) x (2c+2)/3, the new bucket is split into two buckets. (c+1)/3 > 1 since c>2 Buckets are organized into a ring, which can be merged or split with O(1) message.

Maintain Level Property Node join/leave without merging or splitting O(1) Join: size < clog(n), choose a level that has less that c nodes Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them. Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n)) Merging/splitting are expensive, but they do not happen very often After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed Amortized Overhead = c/((c-2)/3) = O(1) for c>2

Amortized analysis Log(n) clog(n) d1d1 d2d2 d 1, d 2 > (c-2)/3 New bucket size Max bucket size min(c/2lgn, (c+1)/3lgn)max(c/2lgn, (2c+2)/3lgn)

Viceroy has no built in support for fault tolerance Viceroy requires graceful leave Leaves are NOT the same as failures Performance is sensitive to failure External techniques: Thickening Edges State Machine Replication Fault Tolerance

State Machine Replication Old New SMR Super nodeViceroy nodes

Related Works De Bruijn Graph Based Network Distance halving D2B Koorde Others Symphony (Small world model) Ulysses (ButterFly, log(n), log(n)/loglogn)

Summary Constant out-degree Expected constant in-degree O(log(n )) w.h.p. O(1) with bucket solution O(log(n )) path length w.h.p Expected log(n )/n load: O(log 2 (n)/n) w.h.p. Weakness/improvements: Not Locality Aware No Fault Tolerance Support Due to the lack of flexibility of ButterFly network

Question Photo by Peter J. Bryant