Distributed Hash Tables

Distributed Hash Tables
CPE 401 / 601 Computer Network Systems Distributed Hash Tables Modified from Ashwin Bharambe, Robert Morris, Krishna Gummadi and Jinyang Li

Peer to peer systems Symmetric in node functionalities
Anybody can join and leave Everybody gives and takes Great potentials Huge aggregate network capacity, storage etc. Resilient to single point of failure and attack E.g. Gnapster, Gnutella, Bittorrent, Skype, Joost

The lookup problem N2 N1 N3 ? N4 N6 N5 Put (Key=“title”
Value=file data…) Internet ? Client Publisher 1000s of nodes. Set of nodes may change… Get(key=“title”) N4 N6 N5 At the heart of all DHTs

Centralized lookup (Napster)
SetLoc(“title”, N4) N3 Client DB N4 Lookup(“title”) Key=“title” Value=file data… N8 N9 N7 N6 O(N) state means its hard to keep the state up to date. Simple, but O(N) state and a single point of failure Not scalable Central service needs $$$

Improved #1: Hierarchical lookup
Performs hierarchical lookup (like DNS) More scalable than a central site Root of the hierarchy is still a single point of vulnerability

Flooded queries (Gnutella)
Lookup(“title”) N3 Client N4 Key=“title” Value=file data… N6 N8 N7 N9 Too much overhead traffic Limited in scale No guarantee for results Robust, but worst case O(N) messages per lookup

Improved #2: super peers
E.g. Kazaa, Skype Lookup only floods super peers Still no guarantees for lookup results

DHT approach: routing Structured: Symmetric: Many protocols:
put(key, value); get(key, value) maintain routing state for fast lookup Symmetric: all nodes have identical roles Many protocols: Chord, Kademlia, Pastry, Tapestry, CAN, Viceroy

Routed queries (Freenet, Chord, etc.)
Client N4 Lookup(“title”) Publisher Key=“title” Value=file data… N6 N8 N7 Challenge: can we make it robust? Small state? Actually find stuff in a changing system? Consistent rendezvous point, between publisher and client. N9

Hash Table Name-value pairs (or key-value pairs) Hash table value
E.g,. “Mehmet Hadi Gunes” and E.g., “ and the Web page E.g., “HitSong.mp3” and “ ” Hash table Data structure that associates keys with values lookup(key) key value value

Distributed Hash Table
Hash table spread over many nodes Distributed over a wide area Main design goals Decentralization no central coordinator Scalability efficient even with large # of nodes Fault tolerance tolerate nodes joining/leaving

Distributed hash table (DHT)
Distributed application put(key, data) get (key) data Distributed hash table lookup(key) node IP address Lookup service node …. Application may be distributed over many nodes DHT distributes data storage over many nodes

DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V

DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V Neighboring nodes are “connected” at the application-level

DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V Operation: take key as input; route messages to node holding key

DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V insert(K1,V1) Operation: take key as input; route messages to node holding key

DHT: basic idea (K1,V1) K V Operation: take key as input; route messages to node holding key

DHT: basic idea K V K V K V K V K V K V K V K V K V K V K V retrieve (K1) Operation: take key as input; route messages to node holding key

Distributed Hash Table
Two key design decisions How do we map names on to nodes? How do we route a request to that node?

Hash Functions Hashing Example hash function Challenges
Transform the key into a number And use the number to index an array Example hash function Hash(x) = x mod 101, mapping to 0, 1, …, 100 Challenges What if there are more than 101 nodes? Fewer? Which nodes correspond to each hash value? What if nodes come and go over time?

Consistent Hashing “view” = subset of hash buckets that are visible
For this conversation, “view” is O(n) neighbors But don’t need strong consistency on views Desired features Balanced: in any one view, load is equal across buckets Smoothness: little impact on hash bucket contents when buckets are added/removed Spread: small set of hash buckets that may hold an object regardless of views Load: across views, # objects assigned to hash bucket is small

Fundamental Design Idea - I
Consistent Hashing Map keys and nodes to an identifier space; implicit assignment of responsibility A B C D Identifiers Key Mapping performed using hash functions (e.g., SHA-1) Spread nodes and keys uniformly throughout

Fundamental Design Idea - II
Prefix / Hypercube routing Source Destination

Definition of a DHT Hash table  supports two operations Distributed
insert(key, value) value = lookup(key) Distributed Map hash-buckets to nodes Requirements Uniform distribution of buckets Cost of insert and lookup should scale well Amount of local state (routing table size) should scale well

DHT ideas Scalable routing needs a notion of “closeness”
Ring structure [Chord]

Different notions of “closeness”
Tree-like structure, [Pastry, Kademlia, Tapestry] Why different choices? Criteria for a good notion for closeness? 011011** 011001** 011010** 011000** Distance measured as the length of longest matching prefix with the lookup key

Different notions of “closeness”
Cartesian space [CAN] (1,1) 0,0.5, 0.5,1 0.5,0.5, 1,1 0,0, 0.5,0.5 0.5,0.25,0.75,0.5 0.5,0,0.75,0.25 0.75,0, 1,0.5 Distance measured as geometric distance (0,0)

Chord Map nodes and keys to identifiers Arrange them on a circle
Using randomizing hash functions Arrange them on a circle Identifier Circle succ(x) x pred(x)

Consistent Hashing Construction Desired features
Construction Assign each of C hash buckets to random points on mod 2n circle; hash key size = n Map object to random position on circle Hash of object = closest clockwise bucket 14 Bucket 12 4 8 Desired features Balanced: No bucket responsible for large number of objects Smoothness: Addition of bucket does not cause major movement among existing buckets Spread and load: Small set of buckets that lie near object

Consistent Hashing Large, sparse identifier space (e.g., 128 bits)
Hash a set of keys x uniformly to large id space Hash nodes to the id space as well 2128-1 1 Id space represented as a ring Hash(name)  object_id Hash(IP_address)  node_id

Where to Store (Key, Value) Pair?
Mapping keys in a load-balanced way Store the key at one or more nodes Nodes with identifiers “close” to the key where distance is measured in the id space Advantages Even distribution Few changes as nodes come and go… Hash(name)  object_id Hash(IP_address)  node_id

Hash a Key to Successor Successor: node with next highest ID
Key ID Node ID N10 K5, K10 K100 N100 Circular ID Space N32 K11, K30 K65, K70 N80 N60 K33, K40, K52 Successor: node with next highest ID

Joins and Leaves of Nodes
Maintain a circularly linked list around the ring Every node has a predecessor and successor pred node succ

How to Find the Nearest Node?
Need to find the closest node To determine who should store (key, value) pair To direct a future lookup(key) query to the node Strawman solution: walk through linked list Circular linked list of nodes in the ring O(n) lookup time when n nodes in the ring Alternative solution: Jump further around ring “Finger” table of additional overlay links

Links in the Overlay Topology
Trade-off between # of hops vs. # of neighbors E.g., log(n) for both, where n is the number of nodes E.g., such as overlay links 1/2, 1/4 1/8, … around the ring Each hop traverses at least half of the remaining distance 1/2 1/4 1/8

Chord “Finger Table” Accelerates Lookups
1/8 1/16 1/32 1/64 Log(N) means we don’t have to track all nodes. Easy to maintain finger info. 1/128 N80

Chord lookups take O(log N) hops
Lookup(K19) Maybe note that fingers point to the first relevant node. N80 N60

Successor Lists Ensure Robust Lookup
10, 20, 32 N5 20, 32, 40 N10 5, 10, 20 N110 32, 40, 60 N20 110, 5, 10 N99 40, 60, 80 N32 N40 60, 80, 99 All r successors have to fail before we have a problem. List ensures we find actual current successor. 99, 110, 5 N80 N60 80, 99, 110 Each node remembers r successors Lookup can skip over dead nodes to find blocks

Nodes Coming and Going Small changes when nodes come and go
Only affects mapping of keys mapped to the node that comes or goes Hash(name)  object_id Hash(IP_address)  node_id

Chord Self-organization
Node join log(n) fingers O(log2 n) cost Node leave Maintain successor list for ring connectivity Update successor list and finger pointers

Chord lookup algorithm properties
Interface: lookup(key)  IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per node Robust: survives massive failures Simple to analyze

Plaxton Trees Motivation Access nearby copies of replicated objects
Time-space trade-off Space = Routing table size Time = Access hops

Plaxton Trees - Algorithm
1. Assign labels to objects and nodes - using randomizing hash functions 9 A E 4 2 4 7 B Object Node

Plaxton Trees - Algorithm
2. Each node knows about other nodes with varying prefix matches 1 2 4 7 B Prefix match of length 0 2 4 7 B 3 Node 2 3 2 4 7 B Prefix match of length 1 2 5 2 4 7 A 2 4 6 2 4 7 B 2 4 7 B Prefix match of length 2 2 4 7 C 2 4 8 Prefix match of length 3

Plaxton Trees Object Insertion and Lookup
Given an object, route successively towards nodes with greater prefix matches 2 4 7 B 9 A E 4 9 A 7 6 Node Object 9 F 1 9 A E 2 Store the object at each of these locations log(n) steps to insert or locate object

Plaxton Trees - Why is it a tree?
Object 9 A E 2 Object 9 A 7 6 Object 9 F 1 Object 2 4 7 B

Pastry Based directly upon Plaxton Trees Exports a DHT interface
Stores an object only at a node whose ID is closest to the object ID In addition to main routing table Maintains leaf set of nodes Closest L nodes (in ID space) L = 2(b + 1) ,typically

State and Neighbor Assignment in Pastry DHT
000 001 010 011 100 101 110 111 Nodes are leaves in a tree logN neighbors in sub-trees of varying heights

Routing in Pastry DHT 010 Route to the sub-tree with the destination
000 001 011 100 101 110 111 111 Route to the sub-tree with the destination

For d dimensions, routing takes O(dn1/d) hops
CAN Map nodes and keys to coordinates in a multi-dimensional cartesian space Zone source key Routing through shortest Euclidean path For d dimensions, routing takes O(dn1/d) hops

State Assignment in CAN
1 Key space is a virtual d-dimensional Cartesian space

1 2 Key space is a virtual d-dimensional Cartesian space

3 1 2 Key space is a virtual d-dimensional Cartesian space

3 1 2 4 Key space is a virtual d-dimensional Cartesian space

Key space is a virtual d-dimensional Cartesian space

CAN Topology and Route Selection
(a,b) S Route by forwarding to the neighbor “closest” to the destination

Expected routing guarantee: O(1/k (log2 n)) hops
Symphony Similar to Chord – mapping of nodes, keys ‘k’ links are constructed probabilistically! This link chosen with probability P(x) = 1/(x ln n) x Expected routing guarantee: O(1/k (log2 n)) hops

SkipNet Previous designs distribute data uniformly throughout the system Good for load balancing But, my data can be stored in Timbuktu! Many organizations want stricter control over data placement What about the routing path? Should a Microsoft  Microsoft end-to-end path pass through Sun?

SkipNet Content and Path Locality
Basic Idea: Probabilistic skip lists Height Nodes Each node choose a height at random Choose height ‘h’ with probability 1/2h

SkipNet Content and Path Locality
Height Nodes machine1.berkeley.edu machine1.unr.edu machine2.unr.edu Nodes are lexicographically sorted Still O(log n) routing guarantee!

What can DHTs do for us? Distributed object lookup
Based on object ID De-centralized file systems CFS, PAST, Ivy Application Layer Multicast Scribe, Bayeux, Splitstream Databases PIER

Where are we now? Many DHTs offering efficient and relatively robust routing Unanswered questions Node heterogeneity Network-efficient overlays vs. Structured overlays Conflict of interest! What happens with high user churn rate? Security

Are DHTs a panacea? Useful primitive
Tension between network efficient construction and uniform key-value distribution Does every non-distributed application use only hash tables? Many rich data structures which cannot be built on top of hash tables alone Exact match lookups are not enough Does any P2P file-sharing system use a DHT?

Distributed Hash Tables

Similar presentations

Presentation on theme: "Distributed Hash Tables"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Hash Tables

Similar presentations

Presentation on theme: "Distributed Hash Tables"— Presentation transcript:

Similar presentations

About project

Feedback