LAND: Locality Aware Networks for Distributed Hash Tables Dahlia Malkhi The Hebrew University of Jerusalem Joint work with: Ittai Abraham and Oren Dobzinski.

LAND: Locality Aware Networks for Distributed Hash Tables Dahlia Malkhi The Hebrew University of Jerusalem Joint work with: Ittai Abraham and Oren Dobzinski

2 Motivation Today ’ s Internet: –Many lightweight clients (web browsers) –Relatively simple servers (HTTP) Client-Server paradigm is suited for a world of thin clients which do not have a lot of bandwidth and computational power. Tomorrow ’ s Internet ? –Most devices will have enough bandwidth and CPU to become both a client and a server (peer). –Users will have an active network presence.

3 Challenges New Distributed Storage and Retrieval Services will face many challenges: Scalability – number of users constantly increases. Dynamism – unlike today ’ s web servers, peers will be constantly joining and leaving. Congestion – Access to data is non-uniform and may cause hot spots. Fault tolerance – availability and reliability of data. Efficiency – systems should be resource efficient and provide the best performance (more on this later).

4 Overlay networks and distributed data structures Hash tables: store and lookup by object id Quorum systems: global search Prefix lookup SQL Google ?

5 The Viceroy Project A project of DANSS (Distributed Computing, Networking and Secure Systems) Research Group in the Hebrew University, headed by Danny Dolev and Dahlia Malkhi. Aims to tackle challenges of future networks. Combines interdisciplinary domains such as distributed computing, graph theory, security and randomization. http://www.cs.huji.ac.il/labs/danss/viceroy/viceroy.html

6 Viceroy Overview Viceroy, the first constant-degree distributed hash table [Malkhi, Naor, Ratacjzak, PODC 02] LAND, the first peer-to-peer network and lookup algorithm that has worst case constant distortion [Abraham, Malkhi, Dobzinski SODA 2004] A generic overlay network approach with implicit load balancing [Abraham, Awerbuch, Azar, Bartal, Malkhi, Pavlov, IPDPS 03] A publish-subscribe mechanisms for scale-free graphs based on probabilistic quorums [Abraham, Malkhi, DISC 03] Small-world DHTs on planar metrics [Abraham, Malkhi, 2003] An optimal asynchronous resource discovery scheme [Abraham, Dolev, PODC 03] Investigation of user privacy and anonymity [Bickson, Malkhi, 2003] An efficient, localized scheme for estimating the number of nodes in a dynamic network [Horowitz, Malkhi, IPL 2003]

7 Overlay Networks for Finding Nearest Copy of Data Nodes construct an overlay layer that allows to use new network architectures and services. A Content Addressable Network allows to route to the target node by examining the object ’ s id. If multiple copies of the same object exist then the closest copy should be accessed. Complexity measures for Overlay Networks: –Number of hops from source node to target node. –Degree of the overlay network. –Amount of additional memory needed per object. –Adaptability: number of nodes that change their state each time a peer joins/leaves the system. –Load on nodes related to the locating task

Previous works

9 Locality Suppose a new DHT ensures each object will be found in 4 hops. So a lookup could begin in Boston and go to Brazil, New Zealand, France and finally New York. For some applications this is not a desired outcome.

10 Network models The Internet –Fully connected weighted graph –Weight = ping latency (?) Internet with geometric coordinates –Distance = geographic distance Mobile network –Geometric space with limited transmission range Arbitrary graphs

11 Distortion Let c(s, t) be the distance from s to t Let s=x 1 x 2 … x k =t the route from s to t Distortion is the ratio between c(x 1, x 2 ) + … + c(x k-1, x k ) and c(s, t)

12 The Model Cost function c that forms a metric –c(x,y) ≥ 0 (positive), c(x,x)=0 (reflexive), c(x,y)=c(y,x) (symmetric) –Triangle inequality: c(x,y) + c(y,z) ≥ c(x,z) Minimal distance between peers is 1. N(x,r) denote the set of nodes at distance <r from x. Growth Bounded Metric: Actually, assume uniform density first r 2r

13 LAND origins and related work Based on the scheme of Plaxton, Rajaraman, Richa. “ Accessing nearby copies of replicated objects in a distributed environment. ” Theory of Computing Systems, 1999. PRR ensures that the expected distortion is constant. Tapestry and Pastry DHT ’ s are both based on the basic static PRR scheme. They enhance PRR by handling dynamic changes in the network. More..

14 LAND architecture A set of objects A. Objects can be stored on any node. –Multiple nodes can keep a replica of each object. Uniformly distributed hash function h(A). Uniformly selected node identifiers Nodes keep transient routing information about objects. Resides here Hashed-home here ?

15 Identifiers and links Each node has n=log(N) identifier 2-bit digits Node a 1 a 2 … a n has n links –Link k `fixes ’ k ’ th digit –Connects to closest node with identifier a 1 a 2 … a k-1 [0-3]* Link k property: –Found with probably 1/4 k –Expected to be found within a ball with 4 k nodes –Goes to distance 2 k

16 Publish and lookup Prefix-routing: fix one digit at a time Publish object A at node t: –Leave reference “ A;t ” at each node en- route Lookup A: route until reference to A found Route distance: At most network diameter = 2 1 + 2 2 + … + 2 n

17 Example of Publish: t w1 w2 w3 node t publishes Obj h(Obj)=101 t=*** w1=1** w2=10* w3=101

18 Problem 1: Link distance could be more than expected Solution: Emulate a shadow node –A shadow node that fixes digit k has links fixing digit k+1 –If a 1 a 2..a k-1 d* not found at distance 2 k then look for a 1 a 2..ak-1d[0-3]* and within 2 (k+1)

19 Expected O(n) number of shadow nodes –Probability of emulating link k at most (1-1/4 k ) 4^k < e -1 –B k+i – expected number of shadow nodes with prefix k+i –E[B k+i ] = E[B k+i |B k+i-1 ] = 2e -1 E[B k+i-1 ] = (2/e) i –Total expected number of shadow nodes starting with link k is constant

20 Problem 2: Unbounded distortion; if target is very close, price is too high Solution: `publish ’ links –Step k in a publish route places reference in appropriate nodes in bigger neighborhood

21 Publish links Denote 2 k ≥ c(s,t) –k is the first such index Step k of lookup route is at most 2 (k+1) away (denote x k ) Step k of publish route is at most 2 (k+1) away (denote w k ) c(x k, w k ) ≤ 3*2 (k+1) Publish a k’th step to all nodes within distance 2 (k+1)+2 with identifer matching k-prefix, so x k will contain a reference to w k

22 Example of Publish: t w1 w2 w3 node t publishes Obj h(Obj)=101 t=*** w1=1** w2=10* w3=101 For example: w1 publishes to nodes with id=1* within a distance proportional to distance(t, w1)

23 x0 Lookup Algorithm t w1 w2 w3 x x1 x2 x3 x4

24 Distortion Route from s to x k Plus distance from x k to w k Plus route from w k to t  constant factor over distance from s to t Distortion can be made close to 1 by increasing range of publish links

25 Summary of LAND properties Guaranteed (small) constant distortion Expected logarithmic node-degree Simple analysis Amenable to dynamic deployment

26 Dynamic maintenance Find closest node to x: –From any node y, let S n = {y} –Recursively, set S k-1 = closest node among incoming prefix-(k-1) links into S k –Closest is S 1 Correctness: –Let s be closest to x; route from s to y –Step k is within 2 (k+1) distance –If S k is closest to x with k-prefix match to S n is within 2 (k+1) distance, then so is S k-1

27 Neighbor finding From closest, route k steps, then back- track incoming links two steps, to find prefix-(k-2) links Correctness: –Node k is 2 (k+1) distance away –Incoming links are 2 k+2 away, covering 2 (k-2) away

28 Back to growth-bound model N(x,r) denote the set of nodes at distance <r from x. Growth Bounded Metric: r 2r

29 Setting node Identifier and Level. Set a radix B Let M such that B M = N. Id has M digits of radix B. Denote A i (x) as the ball around x with α B i nodes (e -α B<1) –A i (x) has constant number of expected nodes with any specific length-i identifier.

30 Network links Each node has M initial routers Router in level k has links to routers in level k+1 : –router u with id a 1,a 2, …,a k,a k+1 … a M and level k, maintains three types of links: 1.Neighbor – for each digit b in [0 … B-1] a link to the closest node with id beginning with a 1,a 2,a 3, …,a k,b and level k+1 inside the ball A k+1 (u). 2.Publish - a link to all nodes with id beginning with a 1,a 2,a 3, …,a k and level k+1 inside the ball A k+5 (u).

31 Example of Publish: t w1 w2 w3 Super-node t publishes Obj h(Obj)=abc t=*** w1=a** w2=ab* w3=abc For example: w1 publishes to routers with level 2 and id=ab* inside the ball A 6 (w1)

32 Enforcing Locality with Shadow Nodes Recall: Neighbor – for each digit b in B a link to the closest node with id a 1,a 2,a 3, …,a k,b and level k+1. Want all neighbor links of a level k node u to be inside A k+1 (u). For any b, if no b ’ th neighbor is in A k+1 (u) then u emulates a shsadow node v with id a 1,a 2,a 3, …,a k,b and level k+1. Node u establishes all of this shadow router ’ s network links. Including v ’ s neighbor links. Recursively this process continues until all shadow nodes have all their links either close enough or emulated.

33 Variations and extensions Two-tier architecture, constant expected node degree Content Addressable Networks Fault Tolerance

34 Two-hop stretch-3 DHT Each node v has identifier h(v) –h() has sqrt(N) different values Node v has links to: –log(N)*sqrt(N) closest nodes, so one of each value w.h.p –All nodes u with h(u)=h(v) Routing from s to t in two hops: –Find node w with h(w)=h(t) –Find t Stretch: –c(s, w) + c(w, t) ≤ c(s, t) + 2c(s, t)

35 Analysis: Balls Recall A i (x) is the smallest ball around x with α B i M nodes (e -α B<1). Suppose y in A i (x) then: –A i (y) in A i+1 (x) –A i (x) in A i+1 (y) y x A i (x) A i+1 (x) A i (y) A i+1 (y)

36 Proof of A i (y) in A i+1 (x) A i (y) is less than N(2a i (x),x) because it contains the ball A i (x) N(2a i (x),y) is less than N(4a i (x),x) by simple distances N(4a i (x),x) is less than A i+1 (x) due to the growth restriction and the way we chose B Proving A i (x) in A i+1 (y) is very similar

37 Growth of Balls Recall A i (x) is the smallest ball around x with α B i nodes (e -α B<1). Let a i (x) denote the radius of A i (x) a i+1 (x) ≤ maxgrow a i (x) a i+1 (x) ≥ mingrow a i (x) A i (x) A i+1 (x) a i+1 (x) a i (x)

38 Analysis: Distortion The initial node is x looking for Obj. x 0 =s is the closest super node. w 0 =t is the closest super node holding Obj. w 0,w 1,w 2,.. is the sequence of nodes used to publish OBj. x 0,x 1,x 2,.. is the set of nodes fixing the bits to reach h(Obj), node x i has level i. X k is the first node that has a reference to OBj published by node w k-1. Need to find bound on path x,x 0,x 1,x 2, …,x k,w k-1,w k- 2, …,w 2,w 1,w 0 =t compared to c(x,t).

39 Analysis: Distortion For every i: – x i in A i+1 (s) and w i in A i+1 (t) –The path from s=x 0 to x i is at most –Similarly for the path from t=w 0 to w i If t in A k (s) then x k contains a reference to Obj Distortion is

40 Analysis: Expected Degree Expected number of virtual nodes emulated by a node is constant. Expected number of publish links is constant. Expected degree of regular nodes is constant. Expected degree of super-nodes is logarithmic.

41 Expected number of emulated nodes is The probability that a random node will be a neighbor link is 1/(B l+1 M). The probability that a neighbor link will be found inside A l+1 (u) is

42 Expected number of emulated nodes is Let b l+i be the number of virtual nodes of level l+i. So b l =1. E(b l+i | b l+i-1 ) = b l+i-1 B e -α. Thus E(b l+i ) = E(b l+i-1 ) B e -α. By induction: E(b l+i ) = (B e -α ) i Number of virtual nodes:

43 Expected number of publish links is The probability that a random node is a level j+1 publish link is 1/(B j M) The probability that a random node is a level i node that emulates a level j+1 virtual node is 1/(B i M) e -α(j+1-i) The total probability that a node is a publish link is bounded by:

44 Expected number of out going links is constant Expected number of publish links in A j+3 is: Thus, expected number of references created during publish to an object is O(M)=O(log n).

45 Expected degree Theorem: Assuming all super-nodes are designated randomly with probability 1/M then the expected degree of all regular nodes is constant and the expected degree of super-nodes is O(M). Need to keep a 1:M ratio between regular nodes and super-nodes.

46 For every i, x i in A i+1 (s) [w i in A i+1 (t)] By induction For i=0: s=x 0 in A 1 (s) Assume x i-1 in A i (s), if x i is emulated by x i-1 then we are done Otherwise A i (x i-1 ) in A i+1 (s) Since x i in A i (x i-1 ) by construction then the induction step holds Same proof for w i in A i+1 (t)

47 The path from s=x 0 to x i is at most By previous lemma c(x j-1,x j )≤2a j+1 (s) Recall a i+1 (x) ≥ mingrow a i (x) so For j ≤ i, a j+1 (s) ≤ mingrow -(i-j) a i+1 (s)

48 If t in A k (s) then x k contains a reference to Obj t in A k (s) so A k (t) in A k+1 (s) w k-1 in A k (t) so w k-1 in A k+1 (s) Thus A k+1 (s) in A k+2 (w k-1 ) x k in A k+1 (s) so x k in A k+2 (w k-1 ) Since w k-1 is a level k-1 node it publishes to all level k nodes in A k+2 (w k- 1 ) including x k

49 Distortion Analysis Denote d = c(x,t) Part 1: from x to s ≤ d Part 2: from s=x 0 to x k Part 3: from x k to w k-1 Part 4: from w k-1 to w 0 =t

50 Distortion Analysis Part 2 from s=x 0 to x k Due to the triangle inequality c(s,t) ≤ 2d Recall : a i+1 (x) ≤ maxgrow a i (x) Since t in A k (s) then a k-1 (s) < c(s,t) so a k+1 (s) < maxgrow 2 c(s,t)

51 Distortion Analysis Part 3 from x k to w k-1 w k-1 in A k (t) A k (t) in A k+1 (s) x k in A k+1 (s) So c(x k,w k-1 ) ≤ 2a k+1 (s) ≤ 2 maxgrow 2 c(s,t) thus

52 Distortion Analysis Part 4 from s=w k-1 to w 0 A k (t) in A k+1 (s)

A Generic Scheme for Building Overlay Networks in Adversarial Scenarios Ittai Abraham (HUJI),Baruch Awerbuch (JHU), Yossi Azar (TAU),Yair Bartal (HUJI), Dahlia Malkhi (HUJI),Elan Pavlov (HUJI)

55 Dynamic Model Suppose the set of node in the network is dynamically evolving. Peers in the DHT are constantly leaving and joining the system. Join: a new node wants to join the system, it initially has access to an existing node. Leave: a node departures from the system, this departure can either be graceful (performing any necessary cleanup operation) or sudden. Low degree overlay network helps reduce overhead in the event of join/leave. This process may cause network imbalance.

56 Coping with Imbalance Solution 1 [FS 01, FSGKS 02]: Assume population is always in between n and ½ n (censorship). Solution 2 [Chord 01, CAN 01]: Execute periodic global overhaul operations for rebalancing. Problem: global operations are costly and may totally shut down the service. Impractical for large systems. Solution 3 [Pastry 01, Tapestry 01]: Assume population change is a random process that maintains the initial randomness. Problems: –DHT systems may have hot spots, and many nodes entering may use the same access node. –Failures tend to be correlated. –A malicious adversary may try to disrupt the network by causing imbalance.

57 Load Balancing Against an Adversary In this work we allow an adversary to adaptively choose: –The order of join and leave events. –For leave events: which node to remove. –For join events: what is the access node of the newly added node. Against such an adversary we employ load balancing upon arrival and departure. After each event the overlay network executes protocols for rebalancing the network.

58 Problem Statement Devise an overlay network and join leave protocols, with the following properties: –Efficient decentralized routing. –Low cost for rebalancing join and leave events against an adversary.

59 Generic Solution for Child-Neighbor Commutative Families Consider a set of graphs G 1, G 2, G 3, … With mapping p i from G i+1 on to G i. Denote child function c i (u)={v| p i (v)=u} Denote neighbor function n(u)={v|(u,v)  E} Child-Neighbor commutative property: For every u: n(c(u))=c(n(u))

60 The Hypercube as an example The hypercube G i :has 2^i nodes each node ’ s id is a binary string of length i. A node in G i :has links to the i nodes that have only one bit different in their id. The child function is c(x)={x0,x1} Example of Child-Neighbor commutative property for node 10 in G 2. n(10)={11,00}, c(n(10))={111,110,001,000} c(10)={100,101}, n(c(10))={000,110,101,001,110,100}

61 The Dynamic Graph For this talk we focus on the Dynamic Hypercube (in paper de Bruijn, Butterfly). Start with two nodes with id 1, 0. Split: change node with id x into nodes x0, x1. Merge: change two twin nodes x0, x1 into node with id x. Not all of the nodes n(x) exist in the network. Edges: A node x connects to all the nodes whose id is a prefix of n(x) or n(x) is a prefix of their id. For example: Node 110 would link to all nodes in the dynamic graph whose id is a prefix of {010, 100, 111} for instance: 01, 100101, 1000, and 111.

62 Example of split Operations on the Dynamic Hypercube Start with 0,1. Split 1 to 10,11 Split 10 to 100,101 Split 0 to 00,01 Split 11 to 110,111 10 1 0 11 101 100 00 01110 111

63 Tree View of Dynamic Graphs Leafs of the tree represent current nodes Inner nodes in the tree represent nodes that were split 000 111 110 101 00 110111 001 100 011 010 000001010011100101 00 Example: merge of 000, 001 into 00

64 Dynamic Graphs Goal: efficient routing (logarithmic in number if nodes). Level of a node – number of bit in identifier (distance from the root). Global gap – difference between smallest level and biggest level. Local gap of a node – difference in the levels of the nodes neighbors. We show that with a logarithmic global gap efficient (logarithmic) routing is possible. Constant local gap implies a logarithmic global gap.

65 Deterministic Balancing Strategy Goal: maintain a local gap of 1. Node addition: starting from the access node using the network links, reach a node with the smallest level and split it. Node removal: starting form the removed node, reach a node with the biggest level and merge it with its twin. If the local gap was 1, it will remain 1. Since the global gap is log n, requires to examine log n nodes at most.

66 Randomized Balancing Strategy Goal: maintain a global gap of O(1) w.h.p. Node addition: Choose log n different locations reach them using the network routing, and split the node with the smallest level. Node removal: Choose log n different locations and merge the node with the biggest level. This process is similar to throwing balls into bins, by choosing log n different bins, adding into the least full bin, and removing from the most full. Requires to examine log n nodes per operation. Analysis: M. Mitzenmacher, A. Richa and R. Sitaraman. The power of two random choices: a survey of techniques and results.

67 In the paper Dynamic de Bruijn, and butterfly networks with logarithmic routing and constant degree. Logarithmic number of messages sent and constant number of nodes that change state per leave/join event. Generalize to any Child-Neighbor commutative family of graphs (like grids).

68 Related Work M. Naor and U. Weilder present a generic way to emulate continuous graphs by discrete graphs [MW - SPAA 03]. Use rapidly mixing random walks on a dynamic de Bruijn network in order to establish probabilistic quorums in dynamic settings [A,Malkhi – DISC 03]. Independently, other de Bruijn based P2P networks: [MW], [D2B: Fraigniaud, Gauron], [Koorde: Kaashoek, Karger]

69 Open Questions Deal with multiple, parallel joins and leaves without waiting till the network rebalances. Use online analysis to find the rate in which the network remains balanced. Find ways to avoid network partitions. Find other graphs that are applicable to the generic construction (random graphs).

70 Questions ? Email: ittaia@cs.huji.ac.ilittaia@cs.huji.ac.il

71 x0 Lookup Algorithm t w1 w2 w3 x s=x1 x2 x3 x4

LAND: Locality Aware Networks for Distributed Hash Tables Dahlia Malkhi The Hebrew University of Jerusalem Joint work with: Ittai Abraham and Oren Dobzinski.

Similar presentations

Presentation on theme: "LAND: Locality Aware Networks for Distributed Hash Tables Dahlia Malkhi The Hebrew University of Jerusalem Joint work with: Ittai Abraham and Oren Dobzinski."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LAND: Locality Aware Networks for Distributed Hash Tables Dahlia Malkhi The Hebrew University of Jerusalem Joint work with: Ittai Abraham and Oren Dobzinski.

Similar presentations

Presentation on theme: "LAND: Locality Aware Networks for Distributed Hash Tables Dahlia Malkhi The Hebrew University of Jerusalem Joint work with: Ittai Abraham and Oren Dobzinski."— Presentation transcript:

Similar presentations

About project

Feedback