SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
How to find data in a distributed file sharing system? o Lookup is the key problem Internet Publisher Key=“LetItBe” Value=MP3 data Lookup(“LetItBe”) N1N1 N2N2 N3N3 N5N5 N4N4 Client ? Motivation
o Requires O(M) state o Single point of failure Internet Publisher Key=“LetItBe” Value=MP3 data Lookup(“LetItBe”) N1N1 N2N2 N3N3 N5N5 N4N4 Client DB o Central server (Napster) Centralized Solution
Centralized: Napster Simple centralized scheme How to find a file? Use a centralized index database Clients contact the index node to upload their list of content at initialization To locate a file Send a query to central index node Get the list of peer locations storing the file Fetch the file directly from one of the peers
o Worst case O(N) messages per lookup Internet Publisher Key=“LetItBe” Value=MP3 data Lookup(“LetItBe”) N1N1 N2N2 N3N3 N5N5 N4N4 Client o Flooding (Gnutella, Morpheus, etc.) Distributed Solution (1)
Flooding: Gnutella Find a node to get onto the system Connect to a few other nodes for resilience How to find a file? Broadcast request to all neighbors On receiving a request, if don’t have the file Re-broadcast to other neighbors If have the file, pass the info to the requester Transfers are done with HTTP between query originating node and the content owner node
Flooding: Gnutella Advantages: Totally decentralized, highly robust Disadvantages: Not scalable Can flood all the network with requests (can use TTL scoping to prevent large scale flooding)
Flooding: FastTrack (aka Kazaa) Modifies the Gnutella protocol into two-level hierarchy Hybrid of Gnutella and Napster Supernodes Nodes that have better connection to Internet Act as temporary indexing servers for other nodes Help improve the stability of the network Standard nodes Connect to supernodes and report list of files Allows slower nodes to participate Search Broadcast (Gnutella-style) search across supernodes Disadvantages Kept a centralized registration
o Routed messages (Freenet, Tapestry, Chord, CAN, etc.) Internet Publisher Key=“LetItBe” Value=MP3 data Lookup(“LetItBe”) N1N1 N2N2 N3N3 N5N5 N4N4 Client o Only exact matches Distributed Solution (2)
What is Chord? What does it do? In short: a peer-to-peer lookup service Solves problem of locating a data item in a collection of distributed nodes, considering frequent node arrivals and departures Core operation in most p2p systems is efficient location of data items Supports just one operation: given a key, it maps the key onto a node Lookup(key) IP address
Chord Property Efficient: O(logN) messages per lookup with “high probability” N is the total number of servers Scalable: O(logN) state per node Robust: survives massive changes in membership Fully decentralized Self-organizing
o m bit identifier space for both keys and nodes o Key identifier = SHA-1(key) Key=“LetItBe” ID=54 SHA-1 IP=“ ” ID=56 SHA-1 o Node identifier = SHA-1(IP address) o Both are uniformly distributed o How to map key IDs to node IDs? Chord IDs
Consistent Hashing [Karger 97] Circular 7-bit ID space IP=“ ” Key=“LetItBe” o A key is stored at its successor: node with next higher ID
Consistent Hashing [Karger 97] Circular 7-bit ID space Where is “LetItBe”? lookup(54) o Every node knows of every other node o requires global information o Routing tables are large O(N) o Lookups are fast O(1)
Chord: Basic Lookup o Every node knows its successor in the ring o requires O(N) time Circular 7-bit ID space
Acceleration of Lookups Lookups are accelerated by maintaining additional routing information Each node maintains a routing table with (at most) m entries (where N=2 m ) called the finger table i th entry in the table at node n contains the identity of the first node, s, that succeeds n by at least 2 i-1 on the identifier circle (clarification on next slide) s = successor(n + 2 i-1 ) (all arithmetic mod 2) s is called the i th finger of node n, denoted by s=n.finger(i)
Acceleration of Lookups Every node knows m other nodes in the ring Store them in a “finger table” with m entries or “fingers” Entry i in finger table of node n points to: n.finger[i] = successor(n + 2 i-1 ) (i.e. the first node ≥ n + 2 i-1 ) n.finger[i] is reffered to as ith finger of n N80 N118 N96 N
Lookup Example What about looking for 17 (choose N21 or N14)? It is more than N8+8 but less than N21 Choose actually N14 (the table may not be accurate). Only the successor of each node is assumed to be correct
What each node maintains? Each node maintains a successor and a predecessor. Only the successor is assumed to always be valid. Well, the predecessor too, but it is not used in searching
n id finger[4]finger[3] n id finger[4]finger[3]finger[2] Return finger[3] even if n + 2^(4-1) < id (you cannot assume there are no nodes between n + 2^(4-1) and finger[4]) finger[4] points to where I think key n + 2^(4-1) should be stored n + 2^(4-1) Lookup algorithm
Comments on lookup Two important characteristics of finger tables Each node stores info about only a small number of others A node does not generally have enough info to directly determine the successor of an arbitrary key k
N8 N8+1N14 N8+2N14 N8+4N14 N8+8N21 N8+16N32 N8+32N42 N14 N21 N32 N38 N42 N48 N51 N56 Key : 54 K 54 N42+1N48 N42+2N48 N42+4N48 N42+8N51 N42+16N8 N Successor of N51 Lookup (54) P n f Distance between f and p is at most 2 i-1 Distance between n and f is at least 2 i-1 m : 6 Scalable key location
Faster Lookups: O(logN) hops whp N32 N10 N5 N20 N110 N99 N80 N60 Lookup(K19) K19
o Three step process: o Initialize all fingers of new node o Updating Successor and Predecessor o Transfer keys from successor to new node o Less aggressive mechanism (lazy finger update): o Initialize only the finger to successor node o Periodically verify immediate successor, predecessor o Periodically refresh finger table entries Joining the Ring
o Initialize the new node finger table o Locate any node p in the ring o Ask node p to lookup fingers of new node N36 o Return results to new node N36 1. Lookup(37,38,40,…,100,164) N60 N40 N5 N20 N99 N80 Joining the Ring - Step 1
o Updating Successor and Predecessor o The successor is identified by finger[1]. o Ask the successor about its predecessor. o Tell your new successor you are its predecessor o Tell the predecessor of your successor that you are its successor. N36 N60 N40 N5 N20 N99 N80 Joining the Ring - Step 2
o Transfer keys from successor node to new node o only keys in the range are transferred Copy keys from N40 to N36 K30 K38 N36 N60 N40 N5 N20 N99 N80 K30 K38 Joining the Ring - Step 3
o Failure of nodes might cause incorrect lookup N120 N113 N102 N80 N85 N10 Lookup(90) o N80 doesn’t know correct successor, so lookup fails o Successor fingers are enough for correctness Handling Failures
o Use successor list o Each node knows r immediate successors o After failure, will know first live successor o Correct successors guarantee correct lookups o Guarantee is with some probability o Can choose r to make probability of lookup failure arbitrarily small Handling Failures
Simulation Setup Packet-level Simulation Packet Delay of Exponential Distribution with the mean of 50ms Iterative Look up Procedure Stabilization Protocol invoked every two minutes in average 24-bit identifiers used
Load Balancing Parameters: 10,000 Nodes 100,000 to 1,000,000 Keys in increments of 100,000 Perform 20 times
Load Balancing
Modification Provide a uniform coverage of the identifier space Allocate a set of virtual nodes to real nodes Allocate r virtual nodes to each real node r = 1, 2, 5, 10, and 20
Modification
Tradeoff r times of space to store routing tables For N=1,000,000 r =20, 400 entries needed for a table
Path Length O(logN) messages needed for a lookup in theory Simulated for 2 k nodes and 100 x 2 k keys In average, 100 keys in each node
Path Length
Path Length (N=2 12 )
Node Failures 10,000 Node Network 1,000,000 Keys A percentage of p nodes to fail
Node Failures Failed lookups
Summary Basic operation: lookup(key) IP addr. Deterministic node position and data placement In an N-node network O(1/N) fraction of keys moved for an Nth node joining or leaving the network O(logN) nodes maintained in a routing table O(logN) messages for a lookup O( (logN) 2 ) messages for update when a node joins or leaves
Question?