Distributed Lookup Systems Chord: A Scalable Peer-to-peer Lookup Service for Internet App. A Scalable Content Addressable Network
Motivation How to find data in a distributed system? N2 N3 N1 N5 ? N4 Publisher Key=“LetItBe” Value=MP3 data Internet N5 ? N4 Client Lookup(“LetItBe”)
Applications Peer-to-peer systems Mirroring (web caching) Napster, Gnutella, Groove, FreeNet, … Large scale storage management systems Publius, OceanStore, PAST, Farsite, CFS ... Mirroring (web caching) Any wide area name resolution system
Outline Types of solutions Evaluation Criteria CAN and Chord basic idea insert/retrieve join/leave recovery from failures
Centralized Solution N2 N3 N1 DB N5 N4 Central server (Napster) Publisher Key=“LetItBe” Value=MP3 data Internet DB N5 N4 Client Lookup(“LetItBe”)
Distributed Solution (1) Flooding (Gnutella, Morpheus, etc.) N2 N3 N1 Publisher Internet N5 N4 Client Lookup(“LetItBe”) Worst case O(m) messages per lookup
Distributed Solution (2) Routed messages (Freenet, Tapestry, Chord, CAN, etc.) N2 N3 N1 Publisher Key=“LetItBe” Value=MP3 data Internet N5 N4 Client Lookup(“LetItBe”)
Routing Challenges Define a useful nearness metric Keep the hop count small Keep the routing table right size Stay robust despite robust changes in membership
Evaluation Scalability Latency Load balancing Robustness routing path length per-node state Latency Load balancing Robustness routing fault tolerance data availability
Content Addressable Network Basic idea Internet scale hash table Interface insert(key,value) value = retrieve(key) Table partitioned among many individual node
CAN - Solution virtual d-dimensional Cartesian coordinate space dynamically partitioned into zones, each “owned” by one node a key mapped into a point P (key, value) is stored at the node which owns the zone P belongs to
CAN - Example I
CAN - Example node I::insert(K,V) I
CAN - Example node I::insert(K,V) I (1) a = hx(K) x = a
CAN - Example I node I::insert(K,V) (1) a = hx(K) b = hy(K) y = b
CAN - Example I node I::insert(K,V) (1) a = hx(K) b = hy(K) (2) route(K,V)-> (a,b)
CAN - Example I node I::insert(K,V) (1) a = hx(K) b = hy(K) (K,V) (2) route(K,V)-> (a,b) (3) (a,b) stores (K,V) (K,V)
CAN - Routing each node maintains state for its neighbors message contains dst coordinates greedy forwarding to neighbor with coordinates closest to dst (a,b) (x,y)
CAN - Node Insertion I new node 1) discover some node “I” already in CAN
CAN – Node Insertion I (p,q) 2) pick random point in space new node
CAN – Node Insertion J I new node 3) I routes to (p,q), discovers node J
CAN - Node Insertion J new 4) split J’s zone in half… new owns one half
CAN – Node Failure Need to repair the space recover database soft-state updates use replication, rebuild database from replicas repair routing takeover algorithm
CAN – Takeover Algorithm Simple failures know your neighbor’s neighbors when a node fails, one of its neighbors takes over its zone More complex failure modes simultaneous failure of multiple adjacent nodes scoped flooding to discover neighbors hopefully, a rare event
CAN - Evaluation Scalability per node, number of neighbors is 2d average routing path is (dn1/d)/4 hops Can scale the network without increasing per-node state
CAN – Design Improvements increase dimensions, realities (reduce the path length) heuristics (reduce the per-CAN-hop latency) RTT-weighted routing multiple nodes per zone (peer nodes) deterministically replicate entries
CAN - Weaknesses Impossible to perform a fuzzy search Susceptible to malicious activity Maintain coherence of all the indexed data (Network overhead, Efficient distribution) Still relatively higher routing latency Poor performance w/o improvement
Chord Based on consistent hashing for key to node mapping standard hashing, e.g. x->ax+b (mod(p)) provide good balance across bins consistent hashing small change in the bucket set does not induce a total remapping of items to buckets
Node identifier = SHA-1(IP address) Chord IDs m bit identifier space for both keys and nodes Key identifier = SHA-1(key) Key=“LetItBe” ID=60 SHA-1 Node identifier = SHA-1(IP address) IP=“198.10.10.1” ID=123 SHA-1 Both are uniformly distributed
Chord – Consistent Hashing K5 IP=“198.10.10.1” N123 K20 Circular 7-bit ID space K101 N32 Key=“LetItBe” N90 K60 A key is stored at its successor: node with next higher ID
Chord – Basic Lookup Every node knows its successor in the ring N10 N10 Where is “LetItBe”? N123 Hash(“LetItBe”) = K60 N32 “N90 has K60” N55 K60 N90
Chord – “Finger Tables” Every node knows m other nodes in the ring Increase distance exponentially Finger i points to successor of n+2i N80 80 + 20 N112 N96 N16 80 + 21 80 + 22 80 + 23 80 + 24 80 + 25 80 + 26
Chord – “Finger Tables” (0) N32’s Finger Table N113 33..33 N40 34..35 N40 36..39 N40 40..47 N40 48..63 N52 64..95 N70 96..31 N102 N102 N32 N85 N40 N80 N79 N52 N70 N60
Chord – Lookup Algorithm (0) N32 N60 N79 N70 N113 N40 N52 N80 33..33 N40 34..35 N40 36..39 N40 40..47 N40 48..63 N52 64..95 N70 96..31 N102 N32’s Finger Table N85
Chord - Node Insertion Three step process: Initialize all fingers of new node Update fingers of existing nodes Transfer keys from successor to new node Less aggressive mechanism (lazy finger update): Initialize only the finger to successor node Periodically verify immediate successor, predecessor Periodically refresh finger table entries
Joining the Ring – Step 1 Initialize the new finger table locate any node p in the ring ask node p to lookup fingers of new node N36 N5 N20 N99 N36 1. Lookup(37,38,40,…,100,164) N40 N80 N60
Joining the Ring – Step 2 Updating fingers of existing nodes new node calls update function on existing nodes existing node can recursively update fingers of other nodes N5 N20 N99 N36 N40 N80 N60
Joining the Ring – Step 3 Transfer keys from successor node to new node N5 N20 N99 N36 K30 K38 Copy keys 21..36 from N40 to N36 N40 K30 K38 N80 N60
Chord – Handling Failures Use successor list each node know r immediate successors after failure will know first live successor correct successors guarantee correct lookups Guarantee is with some probability Can choose r to make probability of lookup failure arbitrarily small
Chord - Evaluation Efficient: O(Log N) messages per lookup N is the total number of servers Scalable: O(Log N) state per node Robust: survives massive changes in membership Assuming no malicious participants
Chord - Weakness NOT that simple (compared to CAN) Member joining is complicated aggressive mechanisms requires too many messages and updates no analysis of convergence in lazy finger mechanism Key management mechanism mixed between layers upper layer does insertion and handle node failures Chord transfer keys when node joins (no leave mechanism!) Routing table grows with # of members in group Worst case lookup can be slow
Summary Both systems: fully distributed and scalable efficient lookup robust simplicity (?) susceptible to malicious activity
How these related to OSD? Very similar if data is “public” If data is “private”, only a few locations are available for storing data Does OSD help (make it easy) for peer-to-peer computing? Any more comments?