Peer to Peer and Distributed Hash Tables CS 271
Distributed Hash Tables Challenge: To design and implement a robust and scalable distributed system composed of inexpensive, individually unreliable computers in unrelated administrative domains Partial thanks Idit Keidar) CS 271
Searching for distributed data Goal: Make billions of objects available to millions of concurrent users e.g., music files Need a distributed data structure to keep track of objects on different sires. map object to locations Basic Operations: Insert(key) Lookup(key) CS 271
Searching N2 N1 N3 ? N4 N6 N5 Key=“title” Value=MP3 data… Internet Client Publisher 1000s of nodes. Set of nodes may change… Lookup(“title”) N4 N6 N5 CS 271
Simple Solution First There was Napster Centralized server/database for lookup Only file-sharing is peer-to-peer, lookup is not Launched in 1999, peaked at 1.5 million simultaneous users, and shut down in July 2001. Pros: low overhead, no duplicates, server knows all nodes that have a certain file, 100% success Cons: single-point-of-failure; load on central DB; requires heavy infrastructure; if central DB is partitioned, success rate is no longer 100% CS 271
Napster: Publish insert(X, 123.2.21.23) ... Publish I have X, Y, and Z! 123.2.21.23 CS 271
Napster: Search 123.2.0.18 Fetch search(A) --> 123.2.0.18 Query Reply Where is file A? CS 271
Overlay Networks A virtual structure imposed over the physical network (e.g., the Internet) A graph, with hosts as nodes, and some edges Overlay Network Keys Node ids A node does not need to know all other nodes. Hash fn Hash fn CS 271
Unstructured Approach: Gnutella Build a decentralized unstructured overlay Each node has several neighbors Holds several keys in its local database When asked to find a key X Check local database if X is known If yes, return, if not, ask your neighbors Use a limiting threshold for propagation. Flooding. In practice: limit TTL of flood CS 271
Gnutella: Search I have file A. Reply Query Where is file A? CS 271
Structured vs. Unstructured The examples we described are unstructured There is no systematic rule for how edges are chosen, each node “knows some” other nodes Any node can store any data so a searched data might reside at any node Structured overlay: The edges are chosen according to some rule Data is stored at a pre-defined place Tables define next-hop for lookup CS 271
Hashing Data structure supporting the operations: void insert( key, item ) item search( key ) Implementation uses hash function for mapping keys to array cells Expected search time O(1) provided that there are few collisions CS 271
Distributed Hash Tables (DHTs) Nodes store table entries lookup( key ) returns the location of the node currently responsible for this key We will mainly discuss Chord, Stoica, Morris, Karger, Kaashoek, and Balakrishnan SIGCOMM 2001 Other examples: CAN (Berkeley), Tapestry (Berkeley), Pastry (Microsoft Cambridge), etc. Same spec, not distributed CS 271
For d dimensions, routing takes O(dn1/d) hops CAN [Ratnasamy, et al] Map nodes and keys to coordinates in a multi-dimensional cartesian space Zone source key Routing through shortest Euclidean path For d dimensions, routing takes O(dn1/d) hops
Chord Logical Structure (MIT) m-bit ID space (2m IDs), usually m=160. Nodes organized in a logical ring according to their IDs. N1 N56 N51 N8 N10 N48 N14 N42 N21 N38 N30
DHT: Consistent Hashing Key 5 K5 Node 105 N105 K20 Circular ID space N32 N90 K80 A key is stored at its successor: node with next higher ID CS 271 Thanks CMU for animation
Consistent Hashing Guarantees For any set of N nodes and K keys: A node is responsible for at most (1 + )K/N keys When an (N + 1)st node joins or leaves, responsibility for O(K/N) keys changes hands CS 271
DHT: Chord Basic Lookup N120 N10 “Where is key 80?” N105 N32 “N90 has K80” Just need to make progress, and not overshoot. Will talk about initialization later. And robustness. Now, how about speed? K80 N90 N60 Each node knows only its successor Routing around the circle, one node at a time. CS 271
DHT: Chord “Finger Table” 1/4 1/2 1/8 1/16 1/32 1/64 1/128 N80 Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ … Entry i in the finger table of node n is the first node that succeeds or equals n + 2i In other words, the ith finger points 1/2n-i way around the ring CS 271
DHT: Chord Join Assume an identifier space [0..8] Node n1 joins 1 7 6 Succ. Table i id+2i succ 0 2 1 1 3 1 2 5 1 1 7 6 2 5 3 4 CS 271
DHT: Chord Join Node n2 joins 1 7 6 2 5 3 4 Succ. Table i id+2i succ i id+2i succ 0 2 2 1 3 1 2 5 1 1 7 6 2 Succ. Table i id+2i succ 0 3 1 1 4 1 2 6 1 5 3 4 CS 271
DHT: Chord Join Nodes n0, n6 join 1 7 6 2 5 3 4 Succ. Table i id+2i succ 0 1 1 1 2 2 2 4 0 Nodes n0, n6 join Succ. Table i id+2i succ 0 2 2 1 3 6 2 5 6 1 7 Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 6 2 Succ. Table i id+2i succ 0 3 6 1 4 6 2 6 6 5 3 4 CS 271
DHT: Chord Join Nodes: n1, n2, n0, n6 Items: f7, f1 1 7 6 2 5 3 4 Succ. Table Items i id+2i succ 0 1 1 1 2 2 2 4 0 7 Nodes: n1, n2, n0, n6 Items: f7, f1 Succ. Table Items 1 7 i id+2i succ 0 2 2 1 3 6 2 5 6 1 Succ. Table 6 2 i id+2i succ 0 7 0 1 0 0 2 2 2 Succ. Table i id+2i succ 0 3 6 1 4 6 2 6 6 5 3 4 CS 271
DHT: Chord Routing Upon receiving a query for item id, a node: Checks whether stores the item locally? If not, forwards the query to the largest node in its successor table that does not exceed id Succ. Table Items i id+2i succ 0 1 1 1 2 2 2 4 0 7 Succ. Table Items 1 7 i id+2i succ 0 2 2 1 3 6 2 5 6 1 query(7) Succ. Table 6 2 i id+2i succ 0 7 0 1 0 0 2 2 2 Succ. Table i id+2i succ 0 3 6 1 4 6 2 6 6 5 3 4 CS 271
Chord Data Structures Finger table First finger is successor Predecessor What if each node knows all other nodes O(1) routing Expensive updates CS 271
Routing Time Node n looks up a key stored at node p p is in n’s ith interval: p ((n+2i-1)mod 2m, (n+2i)mod 2m] n contacts f=finger[i] The interval is not empty so: f ((n+2i-1)mod 2m, (n+2i)mod 2m] f is at least 2i-1 away from n p is at most 2i-1 away from f The distance is halved at each hop. n n+2i-1 f finger[i] p n+2i m steps worst case; O(log N) whp under load balance assumptions.
Routing Time Assuming uniform node distribution around the circle, the number of nodes in the search space is halved at each step: Expected number of steps: log N Note that: m = 160 For 1,000,000 nodes, log N = 20 CS 271
P2P Lessons Decentralized architecture. Avoid centralization Flooding can work. Logical overlay structures provide strong performance guarantees. Churn a problem. Useful in many distributed contexts. CS 271