Distributed Hash Tables CPE 401 / 601 Computer Network Systems Distributed Hash Tables Modified from Ashwin Bharambe, Robert Morris, Krishna Gummadi and Jinyang Li
Peer to peer systems Symmetric in node functionalities Anybody can join and leave Everybody gives and takes Great potentials Huge aggregate network capacity, storage etc. Resilient to single point of failure and attack E.g. Gnapster, Gnutella, Bittorrent, Skype, Joost
The lookup problem N2 N1 N3 ? N4 N6 N5 Put (Key=“title” Value=file data…) Internet ? Client Publisher 1000s of nodes. Set of nodes may change… Get(key=“title”) N4 N6 N5 At the heart of all DHTs
Centralized lookup (Napster) SetLoc(“title”, N4) N3 Client DB N4 Publisher@ Lookup(“title”) Key=“title” Value=file data… N8 N9 N7 N6 O(N) state means its hard to keep the state up to date. Simple, but O(N) state and a single point of failure Not scalable Central service needs $$$
Improved #1: Hierarchical lookup Performs hierarchical lookup (like DNS) More scalable than a central site Root of the hierarchy is still a single point of vulnerability
Flooded queries (Gnutella) Lookup(“title”) N3 Client Publisher@ N4 Key=“title” Value=file data… N6 N8 N7 N9 Too much overhead traffic Limited in scale No guarantee for results Robust, but worst case O(N) messages per lookup
Improved #2: super peers E.g. Kazaa, Skype Lookup only floods super peers Still no guarantees for lookup results
DHT approach: routing Structured: Symmetric: Many protocols: put(key, value); get(key, value) maintain routing state for fast lookup Symmetric: all nodes have identical roles Many protocols: Chord, Kademlia, Pastry, Tapestry, CAN, Viceroy
Routed queries (Freenet, Chord, etc.) Client N4 Lookup(“title”) Publisher Key=“title” Value=file data… N6 N8 N7 Challenge: can we make it robust? Small state? Actually find stuff in a changing system? Consistent rendezvous point, between publisher and client. N9
Hash Table Name-value pairs (or key-value pairs) Hash table value E.g,. “Mehmet Hadi Gunes” and mgunes@unr.edu E.g., “http://cse.unr.edu/” and the Web page E.g., “HitSong.mp3” and “12.78.183.2” Hash table Data structure that associates keys with values lookup(key) key value value
Distributed Hash Table Hash table spread over many nodes Distributed over a wide area Main design goals Decentralization no central coordinator Scalability efficient even with large # of nodes Fault tolerance tolerate nodes joining/leaving
Distributed hash table (DHT) Distributed application put(key, data) get (key) data Distributed hash table lookup(key) node IP address Lookup service node …. Application may be distributed over many nodes DHT distributes data storage over many nodes
DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V
DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V Neighboring nodes are “connected” at the application-level
DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V Operation: take key as input; route messages to node holding key
DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V insert(K1,V1) Operation: take key as input; route messages to node holding key
DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V insert(K1,V1) Operation: take key as input; route messages to node holding key
DHT: basic idea (K1,V1) K V Operation: take key as input; route messages to node holding key
DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V retrieve (K1) Operation: take key as input; route messages to node holding key
Distributed Hash Table Two key design decisions How do we map names on to nodes? How do we route a request to that node?
Hash Functions Hashing Example hash function Challenges Transform the key into a number And use the number to index an array Example hash function Hash(x) = x mod 101, mapping to 0, 1, …, 100 Challenges What if there are more than 101 nodes? Fewer? Which nodes correspond to each hash value? What if nodes come and go over time?
Consistent Hashing “view” = subset of hash buckets that are visible For this conversation, “view” is O(n) neighbors But don’t need strong consistency on views Desired features Balanced: in any one view, load is equal across buckets Smoothness: little impact on hash bucket contents when buckets are added/removed Spread: small set of hash buckets that may hold an object regardless of views Load: across views, # objects assigned to hash bucket is small
Fundamental Design Idea - I Consistent Hashing Map keys and nodes to an identifier space; implicit assignment of responsibility A B C D Identifiers Key 1111111111 0000000000 Mapping performed using hash functions (e.g., SHA-1) Spread nodes and keys uniformly throughout
Fundamental Design Idea - II Prefix / Hypercube routing Source Destination
Definition of a DHT Hash table supports two operations Distributed insert(key, value) value = lookup(key) Distributed Map hash-buckets to nodes Requirements Uniform distribution of buckets Cost of insert and lookup should scale well Amount of local state (routing table size) should scale well
DHT ideas Scalable routing needs a notion of “closeness” Ring structure [Chord]
Different notions of “closeness” Tree-like structure, [Pastry, Kademlia, Tapestry] 01100101 01100111 01100101 01100110 01100100 Why different choices? Criteria for a good notion for closeness? 011011** 011001** 011010** 011000** Distance measured as the length of longest matching prefix with the lookup key
Different notions of “closeness” Cartesian space [CAN] (1,1) 0,0.5, 0.5,1 0.5,0.5, 1,1 0,0, 0.5,0.5 0.5,0.25,0.75,0.5 0.5,0,0.75,0.25 0.75,0, 1,0.5 Distance measured as geometric distance (0,0)
Chord Map nodes and keys to identifiers Arrange them on a circle Using randomizing hash functions Arrange them on a circle Identifier Circle succ(x) 010111110 x 010110110 pred(x) 010110000
Consistent Hashing Construction Desired features Construction Assign each of C hash buckets to random points on mod 2n circle; hash key size = n Map object to random position on circle Hash of object = closest clockwise bucket 14 Bucket 12 4 8 Desired features Balanced: No bucket responsible for large number of objects Smoothness: Addition of bucket does not cause major movement among existing buckets Spread and load: Small set of buckets that lie near object
Consistent Hashing Large, sparse identifier space (e.g., 128 bits) Hash a set of keys x uniformly to large id space Hash nodes to the id space as well 2128-1 1 Id space represented as a ring Hash(name) object_id Hash(IP_address) node_id
Where to Store (Key, Value) Pair? Mapping keys in a load-balanced way Store the key at one or more nodes Nodes with identifiers “close” to the key where distance is measured in the id space Advantages Even distribution Few changes as nodes come and go… Hash(name) object_id Hash(IP_address) node_id
Hash a Key to Successor Successor: node with next highest ID Key ID Node ID N10 K5, K10 K100 N100 Circular ID Space N32 K11, K30 K65, K70 N80 N60 K33, K40, K52 Successor: node with next highest ID
Joins and Leaves of Nodes Maintain a circularly linked list around the ring Every node has a predecessor and successor pred node succ
How to Find the Nearest Node? Need to find the closest node To determine who should store (key, value) pair To direct a future lookup(key) query to the node Strawman solution: walk through linked list Circular linked list of nodes in the ring O(n) lookup time when n nodes in the ring Alternative solution: Jump further around ring “Finger” table of additional overlay links
Links in the Overlay Topology Trade-off between # of hops vs. # of neighbors E.g., log(n) for both, where n is the number of nodes E.g., such as overlay links 1/2, 1/4 1/8, … around the ring Each hop traverses at least half of the remaining distance 1/2 1/4 1/8
Chord “Finger Table” Accelerates Lookups ¼ ½ 1/8 1/16 1/32 1/64 Log(N) means we don’t have to track all nodes. Easy to maintain finger info. 1/128 N80
Chord lookups take O(log N) hops Lookup(K19) Maybe note that fingers point to the first relevant node. N80 N60
Successor Lists Ensure Robust Lookup 10, 20, 32 N5 20, 32, 40 N10 5, 10, 20 N110 32, 40, 60 N20 110, 5, 10 N99 40, 60, 80 N32 N40 60, 80, 99 All r successors have to fail before we have a problem. List ensures we find actual current successor. 99, 110, 5 N80 N60 80, 99, 110 Each node remembers r successors Lookup can skip over dead nodes to find blocks
Nodes Coming and Going Small changes when nodes come and go Only affects mapping of keys mapped to the node that comes or goes Hash(name) object_id Hash(IP_address) node_id
Chord Self-organization Node join log(n) fingers O(log2 n) cost Node leave Maintain successor list for ring connectivity Update successor list and finger pointers
Chord lookup algorithm properties Interface: lookup(key) IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per node Robust: survives massive failures Simple to analyze
Plaxton Trees Motivation Access nearby copies of replicated objects Time-space trade-off Space = Routing table size Time = Access hops
Plaxton Trees - Algorithm 1. Assign labels to objects and nodes - using randomizing hash functions 9 A E 4 2 4 7 B Object Node
Plaxton Trees - Algorithm 2. Each node knows about other nodes with varying prefix matches 1 2 4 7 B Prefix match of length 0 2 4 7 B 3 Node 2 3 2 4 7 B Prefix match of length 1 2 5 2 4 7 A 2 4 6 2 4 7 B 2 4 7 B Prefix match of length 2 2 4 7 C 2 4 8 Prefix match of length 3
Plaxton Trees Object Insertion and Lookup Given an object, route successively towards nodes with greater prefix matches 2 4 7 B 9 A E 4 9 A 7 6 Node Object 9 F 1 9 A E 2 Store the object at each of these locations log(n) steps to insert or locate object
Plaxton Trees - Why is it a tree? Object 9 A E 2 Object 9 A 7 6 Object 9 F 1 Object 2 4 7 B
Pastry Based directly upon Plaxton Trees Exports a DHT interface Stores an object only at a node whose ID is closest to the object ID In addition to main routing table Maintains leaf set of nodes Closest L nodes (in ID space) L = 2(b + 1) ,typically
State and Neighbor Assignment in Pastry DHT 000 001 010 011 100 101 110 111 Nodes are leaves in a tree logN neighbors in sub-trees of varying heights
Routing in Pastry DHT 010 Route to the sub-tree with the destination 000 001 011 100 101 110 111 111 Route to the sub-tree with the destination
For d dimensions, routing takes O(dn1/d) hops CAN Map nodes and keys to coordinates in a multi-dimensional cartesian space Zone source key Routing through shortest Euclidean path For d dimensions, routing takes O(dn1/d) hops
State Assignment in CAN 1 Key space is a virtual d-dimensional Cartesian space
State Assignment in CAN 1 2 Key space is a virtual d-dimensional Cartesian space
State Assignment in CAN 3 1 2 Key space is a virtual d-dimensional Cartesian space
State Assignment in CAN 3 1 2 4 Key space is a virtual d-dimensional Cartesian space
State Assignment in CAN Key space is a virtual d-dimensional Cartesian space
CAN Topology and Route Selection (a,b) S Route by forwarding to the neighbor “closest” to the destination
Expected routing guarantee: O(1/k (log2 n)) hops Symphony Similar to Chord – mapping of nodes, keys ‘k’ links are constructed probabilistically! This link chosen with probability P(x) = 1/(x ln n) x Expected routing guarantee: O(1/k (log2 n)) hops
SkipNet Previous designs distribute data uniformly throughout the system Good for load balancing But, my data can be stored in Timbuktu! Many organizations want stricter control over data placement What about the routing path? Should a Microsoft Microsoft end-to-end path pass through Sun?
SkipNet Content and Path Locality Basic Idea: Probabilistic skip lists Height Nodes Each node choose a height at random Choose height ‘h’ with probability 1/2h
SkipNet Content and Path Locality Height Nodes machine1.berkeley.edu machine1.unr.edu machine2.unr.edu Nodes are lexicographically sorted Still O(log n) routing guarantee!
What can DHTs do for us? Distributed object lookup Based on object ID De-centralized file systems CFS, PAST, Ivy Application Layer Multicast Scribe, Bayeux, Splitstream Databases PIER
Where are we now? Many DHTs offering efficient and relatively robust routing Unanswered questions Node heterogeneity Network-efficient overlays vs. Structured overlays Conflict of interest! What happens with high user churn rate? Security
Are DHTs a panacea? Useful primitive Tension between network efficient construction and uniform key-value distribution Does every non-distributed application use only hash tables? Many rich data structures which cannot be built on top of hash tables alone Exact match lookups are not enough Does any P2P file-sharing system use a DHT?