1 Structured P2P overlay. 2 Outline Introduction Chord  I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet.

1 Structured P2P overlay

2 Outline Introduction Chord  I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications”, SIGCOMM 2001 Can  S. RATNASAMY, P. FRANCIS, M. HANDLEY, R. KARP, and S. SHENKER, “A Scalable Content-Addressable Network”, SIGCOMM2001 Yapper  P. Ganesan, Q. Sun, and H. Garcia-Molina, “YAPPERS: A peer-to-peer lookup service over arbitrary topology”, INFOCOM 2003 Pastry  A. Rowstron, and P. Druschel, "Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems“, 18th IFIP/ACM Int'l Conf. on Distributed Systems Platforms

3 Introduction: P2P-Overlay Networks +: distribute costs  bandwidth  maintaining -: additional costs  traffic overhead TCP/IP Peers Overlay Network Underlying Network

4 Peer BPeer CPeer A Distributed Hash Tables (DHTs) DHT: Peers contain the buckets of the hash table  DHT is a structured overlay that offers extreme scalability and hash-table-like lookup interface  Each peer responsible for a set of hashed IDs hash(42) hash(23)

5  Principle: Index references are placed on the peers responsible for their hashed identifiers.  Node A (provider) indexes object at responsible peer B. Structured Overlay Networks: Example 1. Publish link at responsible Peer 3. P2P com- munication. Get link to object. 2. “Routing” to / Lookup of desired Object ? The object reference is routed to B.  Node C looking for object sends query to the network. Query is routed to the responsible node.  Node B replies to C by sending contact information of A

6 Peer-Neighborhood Routing Table (who do I know)‏ PeerA: Peer-ID, IP-Address PeerB: Peer-ID, IP-Address PeerC: Peer-ID, IP-Address Content (what do I have)‏ File “abc” File “xyz”... Distributed Index (what do I know)‏ File “foo”: IP-Address File “bar”: IP-Address … Peer APeer B Peer C

7 Two important issues Load balancing Neighbor table consistency preserving

8 Chord A Scalable Peer-to-peer Lookup Service for Internet Applications Prepared by Ali Yildiz

9 What is Chord? In short: a peer-to-peer lookup system Given a key (data item), it maps the key onto a node (peer). Uses consistent hashing to assign keys to nodes. Solves problem of locating key in a collection of distributed nodes. Maintains routing information as nodes join and leave the system

10 What is Chord? - Addressed Problems Load balance: distributed hash function, spreading keys evenly over nodes Decentralization: chord is fully distributed, no node more important than other, improves robustness Scalability: logarithmic growth of lookup costs with number of nodes in network, even very large systems are feasible Availability: chord automatically adjusts its internal tables to ensure that the node responsible for a key can always be found

11 What is Chord? - Example Application Highest layer provides a file-like interface to user including user-friendly naming and authentication This file systems maps operations to lower-level block operations Block storage uses Chord to identify responsible node for storing a block and then talk to the block storage server on that node File System Block Store Chord Block Store Chord Block Store Chord ClientServer

12 Consistent Hashing Consistent hash function assigns each node and key an m-bit identifier. SHA-1 is used as a base hash function. A node’s identifier is defined by hashing the node’s IP address. A key identifier is produced by hashing the key (chord doesn’t define this. Depends on the application).  ID(node) = hash(IP, Port)  ID(key) = hash(key)

13 Consistent Hashing In an m-bit identifier space, there are 2 m identifiers. Identifiers are ordered on an identifier circle modulo 2 m. The identifier ring is called Chord ring. Key k is assigned to the first node whose identifier is equal to or follows (the identifier of) k in the identifier space. This node is the successor node of key k, denoted by successor(k).

14 6 1 2 6 0 4 26 5 1 3 7 2 identifier circle identifier node X key Consistent Hashing - Successor Nodes successor(1) = 1 successor(2) = 3successor(6) = 0

15 Consistent Hashing For m = 6, # of identifiers is 64. The following Chord ring has 10 nodes and stores 5 keys. The successor of key 10 is node 14.

16 Consistent Hashing – Join and Departure When a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n. When node n leaves the network, all of its assigned keys are reassigned to n’s successor.

17 Consistent Hashing – Node Join 0 4 26 5 1 3 7 keys 1 2 7 5

18 Consistent Hashing – Node Dep. 0 4 26 5 1 3 7 keys 1 2 6 7

19 Consistent Hashing When node 26 joins the network:

20 A Simple Key Lookup A very small amount of routing information suffices to implement consistent hashing in a distributed environment If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order. Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key.

21 A Simple Key Lookup Pseudo code for finding successor: // ask node n to find the successor of id n.find_successor(id) if (id  (n, successor]) return successor; else // forward the query around the circle return successor.find_successor(id);

22 A Simple Key Lookup The path taken by a query from node 8 for key 54:

23 Scalable Key Location To accelerate lookups, Chord maintains additional routing information.  Finger Tables Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. The i th entry in the table at node n contains the identity of the first node s that succeeds n by at least 2 i-1 on the identifier circle. s = successor(n+2 i-1 ). s is called the i th finger of node n, denoted by n.finger(i)

24 Scalable Key Location – Finger Tables 0 4 26 5 1 3 7 1 2 4 1 3 0 finger table startsucc. keys 1 235235 330330 finger table startsucc. keys 2 457457 000000 finger table startsucc. keys 6 0+2 0 0+2 1 0+2 2 For. 1+2 0 1+2 1 1+2 2 For. 3+2 0 3+2 1 3+2 2 For.

25 Scalable Key Location – Finger Tables A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node. The first finger of n is the immediate successor of n on the circle.

26 Scalable Key Location – Example query The path a query for key 54 starting at node 8:

27 Scalable Key Location – A characteristic Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem: Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN).

28 Node Joins and Stabilizations The most important thing is the successor pointer. If the successor pointer is ensured to be up to date, which is sufficient to guarantee correctness of lookups, then finger table can always be verified. Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.

29 Node Joins and Stabilizations “Stabilization” protocol contains 6 functions:  create()  join()  stabilize()  notify()  fix_fingers()  check_predecessor()

30 Node Joins – join() When node n first starts, it calls n.join(n’), where n’ is any known Chord node. The join() function asks n’ to find the immediate successor of n. join() does not make the rest of the network aware of n.

31 Node Joins – join() // create a new Chord ring. n.create() predecessor = nil; successor = n; // join a Chord ring containing node n’. n.join(n’) predecessor = nil; successor = n’.find_successor(n);find_successor

32 Scalable Key Location – find_successor() Pseudo code: // ask node n to find the successor of id n.find_successor(id) if (id  (n, successor]) return successor; else n’ = closest_preceding_node(id); return n’.find_successor(id); // search the local table for the highest predecessor of id n.closest_preceding_node(id) for i = m downto 1 if (finger[i]  (n, id)) return finger[i]; return n;

33 Node Joins – stabilize() stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n. The successor does this only if it knows of no closer predecessor than n.

34 Node Joins – Join and Stabilization npnp succ(n p ) = n s nsns n pred(n s ) = n p n joins  predecessor = nil  n acquires n s as successor via some n’ n runs stabilize  n notifies n s being the new predecessor  n s acquires n as its predecessor n p runs stabilize  n p asks n s for its predecessor (now n)  n p acquires n as its successor  n p notifies n  n will acquire n p as its predecessor all predecessor and successor pointers are now correct fingers still need to be fixed, but old fingers will still work nil pred(n s ) = n succ(n p ) = n

35 Node Joins – fix_fingers() Each node periodically calls fix fingers to make sure its finger table entries are correct. It is how new nodes initialize their finger tables It is how existing nodes incorporate new nodes into their finger tables.

36 Node Joins – fix_fingers() // called periodically. refreshes finger table entries. n.fix_fingers() next = next + 1 ; if (next > m) next = 1 ; finger[next] = find_successor(n + 2 next-1 );find_successor // checks whether predecessor has failed. n.check_predecessor() if (predecessor has failed) predecessor = nil;

37 Node Failures Key step in failure recovery is maintaining correct successor pointers To help achieve this, each node maintains a successor-list of its r nearest successors on the ring If node n notices that its successor has failed, it replaces it with the first live entry in the list Successor lists are stabilized as follows:  node n reconciles its list with its successor s by copying s’s successor list, removing its last entry, and prepending s to it.  If node n notices that its successor has failed, it replaces it with the first live entry in its successor list and reconciles its successor list with its new successor.

38 Chord – The Math Every node is responsible for about K/N keys (N nodes, K keys) When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node) Lookups need O(log N) messages To reestablish routing invariants and finger tables after node joining or leaving, only O(log 2 N) messages are required

39 Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker A Scalable, Content- Addressable Network ACIRI U.C.Berkeley Tahoe Networks 1 23 1,231 1

40 Content-Addressable Network (CAN) CAN: Internet-scale hash table Interface  insert(key,value)  value = retrieve(key) Properties  scalable  operationally simple  good performance

41 K V CAN: basic idea K V

42 CAN: basic idea insert (K 1,V 1 ) K V

43 CAN: basic idea insert (K 1,V 1 ) K V

44 CAN: basic idea (K 1,V 1 ) K V

45 CAN: basic idea retrieve (K 1 ) K V

46 CAN: solution virtual Cartesian coordinate space entire space is partitioned amongst all the nodes  every node “owns” a zone in the overall space abstraction  CAN store data at “points” in the space  CAN route from one “point” to another point = node that owns the enclosing zone

47 CAN: simple example 1

48 CAN: simple example 12

49 CAN: simple example 1 2 3

50 CAN: simple example 1 2 3 4

51 CAN: simple example

52 CAN: simple example I

53 CAN: simple example node I::insert(K,V) I

54 (1) a = h x (K) CAN: simple example x = a node I::insert(K,V) I

55 (1) a = h x (K) b = h y (K) CAN: simple example x = a y = b node I::insert(K,V) I

56 (1) a = h x (K) b = h y (K) CAN: simple example (2) route(K,V) -> (a,b) node I::insert(K,V) I

57 CAN: simple example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = h x (K) b = h y (K)

58 CAN: simple example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = h x (K) b = h y (K) node J::retrieve(K) J

59 Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address) CAN

60 CAN: routing table

61 CAN: routing (a,b) (x,y)

62 A node only maintains state for its immediate neighboring nodes CAN: routing

63 CAN: node insertion Bootstrap node 1) Discover some node “I” already in CAN new node

64 CAN: node insertion I new node 1) discover some node “I” already in CAN

65 CAN: node insertion 2) pick random point in space I (p,q) new node

66 CAN: node insertion (p,q) 3) I routes to (p,q), discovers node J I J new node

67 CAN: node insertion new J 4) split J’s zone in half… new owns one half

68 Inserting a new node affects only a single other node and its immediate neighbors CAN: node insertion

69 CAN: node failures Need to repair the space  recover database soft-state updates use replication, rebuild database from replicas  repair routing takeover algorithm

70 CAN: takeover algorithm Simple failures  know your neighbor’s neighbors  when a node fails, one of its neighbors takes over its zone More complex failure modes  simultaneous failure of multiple adjacent nodes  scoped flooding to discover neighbors  hopefully, a rare event

71 Only the failed node’s immediate neighbors are required for recovery CAN: node failures

72 Evaluation Scalability Low-latency Load balancing Robustness

73 CAN: scalability For a uniformly partitioned space with n nodes and d dimensions  per node, number of neighbors is 2d  average routing path is (dn 1/d )/4 hops  simulations show that the above results hold in practice Can scale the network without increasing per-node state Chord  log(n) nbrs with log(n) hops

74 CAN: low-latency Problem  latency stretch = (CAN routing delay) (IP routing delay)  application-level routing may lead to high stretch Solution  increase dimensions  heuristics RTT-weighted routing multiple nodes per zone (peer nodes)

75 CAN: low-latency #nodes Latency stretch 16K32K65K131K #dimensions = 2 w/o heuristics w/ heuristics

76 CAN: low-latency #nodes Latency stretch 16K32K65K131K #dimensions = 10 w/o heuristics w/ heuristics

77 CAN: load balancing Two pieces  Dealing with hot-spots popular (key,value) pairs nodes cache recently requested entries overloaded node replicates popular entries at neighbors  Uniform coordinate space partitioning uniformly spread (key,value) entries uniformly spread out routing load

78 Uniform Partitioning Added check  at join time, pick a zone  check neighboring zones  pick the largest zone and split that one

79 Routing resilience destination source

80 Routing resilience

81 Routing resilience destination

83 Node X::route(D) If (X cannot make progress to D) check if any neighbor of X can make progress if yes, forward message to one such nbr Routing resilience

85 Routing resilience dimensions Pr(successful routing) CAN size = 16K nodes Pr(node failure) = 0.25

86 Routing resilience CAN size = 16K nodes #dimensions = 10 Pr(node failure) Pr(successful routing)

87 Summary CAN  an Internet-scale hash table Scalability  O(d) per-node state Low-latency routing  simple heuristics help a lot Robust  decentralized, can route around trouble

88 YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology Qixiang Sun Prasanna Ganesan Hector Garcia-Molina Stanford University

89 Problem Where is X?

90 Problem (2) 1. Search 2. Node join/leave 3. Register/remove content A B C

91 Background Gnutella-style join  anywhere in the overlay register  do nothing search  flood the overlay

92 Background (2) Distributed hash table (DHT) join  a unique location in the overlay register  place pointer at a unique node search  route towards the unique node... Chord CAN

93 Background (3) Gnutella-style +Simple +Local control +Robust +Arbitrary topology  Inefficient  Disturbs many nodes DHT +Efficient search  Restricted overlay  Difficulty with dynamism

94 Motivation Best of both worlds  Gnutella’s local interactions  DHT-like efficiency

95 Partition Nodes Given any overlay, first partition nodes into buckets (colors) based on hash of IP

96 Partition Nodes Given any overlay, first partition nodes into buckets (colors) based on hash of IP

97 Partition Nodes (2) Around each node, there is at least one node of each color XY May require backup color assignments

98 Register Content Partition content space into buckets (colors) and register pointer at “nearby” nodes. Z register red content locally register blue content at a blue node Nodes around Z form a small hash table!

99 Searching Content Start at a “nearby” colored node, search other nodes of the same color. V U XY Z W

100 Searching Content (2) A smaller overlay for each color and use Gnutella-style flood Fan-out = degree of nodes in the smaller overlay

101 Recap node join  anywhere in the overlay register content  at nearby node(s) of the appropriate color search  start at a nearby node of the search color and then flood nodes of the same color.

102 Design Issues How to build a small hash table around each node, i.e., assign colors? How to connect nodes of the same color?

103 Neighbors IN : immediate neighborhood EN: extended neighborhood F(A): frontier of node A

104 Small-scale Hash Table Small = all nodes within h hops (e.g., h=2, IN)  Consistent across overlapping hash tables  Stable when nodes enter/leave A B X C

105 Small-scale Hash Table (2) Fixed number of buckets (colors) Determine bucket (color) based on the hash value of node IP addresses  Multiple nodes of the same color Pick any one of these nodes to store the key  No nodes of a color Backup assignment

106 Searching the Overlay Find another node of the same color in a “nearby” hash table All nodes within h hops A C B Frontier Node Need to track all nodes within 2h+1 hops

107 Searching the Overlay (2) For a color C and each frontier node v, 1. determine which nodes v might contact to search for color C 2. contact these nodes Theorem: Regardless of starting node, one can search all nodes of all color.

108 Recap Hybrid approach  Around each node, act like a hash table  Flood the relevant nodes in the entire network What do we gain?  Respect original overlay  Efficient search for popular data  Avoid disturbing nodes unnecessarily

109 Brief Evaluation Using a 24,702 nodes Gnutella snapshot as the underlying overlay We study  Number of nodes contacted per query when searching the entire network  Trade-off in using our hybrid approach when flooding the entire network

110 Nodes Searched per Query Limited by the number of nodes “nearby”

111 Trade-off Fan-out = degree of each colored node when flooding “nearby” nodes of the same color Average Fan-out Vanilla 835 Heuristics 82 Good in searching nearby nodes quickly. Bad in searching the entire network

112 Overloading a Node A node may have many colors even if it has a large neighborhood. A X

113 Enhancements Prune fringe nodes (low connectivity node) Biased backup node assignment  Node X can assigh a backup color to node Y if and only if  *|IN(X)| > |IN(Y)|, where  controls the relative sizes of the neighborhoods  Forbid a node with a small immediate neighborhood assigning backup colors to a node with a large immediate neighborhood

114 Conclusion Does YAPPERS work?  YES Respects the original overlay Searches efficiently in small area Disturbs fewer nodes than Gnutella Handles node arrival/departure better than DHT  NO Large fan-out (vanilla flooding won’t work)

115 For More Information A short position paper advocating locally- organized P2P systems http://dbpubs.stanford.edu/pub/2002-60 Other P2P work at Stanford http://www-db.stanford.edu/peers

1 Structured P2P overlay. 2 Outline Introduction Chord  I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet.

Similar presentations

Presentation on theme: "1 Structured P2P overlay. 2 Outline Introduction Chord  I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Structured P2P overlay. 2 Outline Introduction Chord  I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet.

Similar presentations

Presentation on theme: "1 Structured P2P overlay. 2 Outline Introduction Chord  I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet."— Presentation transcript:

Similar presentations

About project

Feedback