Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker A Scalable, Content Addressable Network Slides by Authors above with additions/modificatons by Michael Isaacs (UCSB) Tony Hannan (Georgia Tech)
Outline Introduction Design Design Improvements Evalution Related Work Conclusion
Hash tables – essential building block in software systems Internet-scale distributed hash tables – equally valuable to large-scale distributed systems peer-to-peer systems – Napster, Gnutella, Groove, FreeNet, MojoNation… large-scale storage management systems – Publius, OceanStore, PAST, Farsite, CFS... mirroring on the Web Internet-scale hash tables
Content-Addressable Network (CAN) CAN: Internet-scale hash table Interface – insert (key, value) – value = retrieve (key) Properties – scalable – operationally simple – good performance Related systems: Chord, Plaxton/Tapestry...
Problem Scope provide the hashtable interface scalability robustness performance
Not in scope security anonymity higher level primitives – keyword searching – mutable content
Outline Introduction Design Design Improvements Evalution Related Work Conclusion
K V CAN: basic idea K V
CAN: basic idea insert (K 1,V 1 ) K V
CAN: basic idea insert (K 1,V 1 ) K V
CAN: basic idea (K 1,V 1 ) K V
CAN: basic idea retrieve (K 1 ) K V
CAN: solution Virtual Cartesian coordinate space Entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space Abstraction – can store data at “points” in the space – can route from one “point” to another Point = node that owns the enclosing zone
CAN: simple example 1
12
1 2 3
I
node I::insert(K,V) I
(1) a = h x (K) CAN: simple example x = a node I::insert(K,V) I
(1) a = h x (K) b = h y (K) CAN: simple example x = a y = b node I::insert(K,V) I
(1) a = h x (K) b = h y (K) CAN: simple example (2) route(K,V) -> (a,b) node I::insert(K,V) I
CAN: simple example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = h x (K) b = h y (K)
CAN: simple example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = h x (K) b = h y (K) node J ::retrieve(K) J
Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address) CAN
CAN: routing table
CAN: routing (a,b) (x,y)
A node only maintains state for its immediate neighboring nodes CAN: routing
CAN: node insertion Bootstrap node 1) Discover some node “I” already in CAN new node
CAN: node insertion I new node 1) Discover some node “I” already in CAN
CAN: node insertion 2) pick random point in space I (p,q) new node
CAN: node insertion (p,q) 3) I routes to (p,q), discovers node J I J new node
CAN: node insertion new J 4) split J’s zone in half… new owns one half
Inserting a new node affects only a single other node and its immediate neighbors CAN: node insertion
CAN: node failures Need to repair the space – recover database state updates use replication, rebuild database from replicas – repair routing takeover algorithm
CAN: takeover algorithm Nodes periodically send update message to neigbors An absent update message indicates a failure – Each node send a recover message to the takeover node – The takeover node takes over the failed node's zone
CAN: takeover node The takeover node is the node whose VID is numerically closest to the failed nodes VID. The VID is a binary string indicating the path from the root to the zone in the binary partition tree
CAN: VID predecessor / successor To handle multiple neighbors failing simultaneously, each node maintains its predecessor and successor in the VID ordering (idea borrowed from Chord). This guarantees that the takeover node can always be found.
Only the failed node’s immediate neighbors are required for recovery CAN: node failures
Basic Design recap Basic CAN – completely distributed – self-organizing – nodes only maintain state for their immediate neighbors
Outline Introduction Design Design Improvements Evalution Related Work Conclusion
Design Improvement Objectives Reduce latency of CAN routing – Reduce CAN path length – Reduce per-hop latency (IP length/latency per CAN hop) Improve CAN robustness – data availability – routing
Design Improvement Tradeoffs Routing performance and system robustnes » vs. Increased per-node state and system complexity
Design Improvement Techniques 1. Increase number of dimensions in coordinate space 2. Multiple coordinate spaces (realities) 3. Better CAN routing metrics (measure & consider time) 4. Overloading coordinate zones (peer nodes) 5. Multiple hash functions 6. Geographic-sensitive construction of CAN overlay network (distributed binning) 7. More uniform partitioning (when new nodes enter) 8. Caching & Replication techniques for “hot-spots” 9. Background load balancing algorithm
1. Increase dimensions
2. Multiple realities Multiple (r) coordinate spaces Each node maintains a different zone in each reality Contents of hash table replicated on each reality Routing chooses neighbor closest in any reality (can make large jumps towards target) Advantages: – Greater data availability – Fewer hops
Multiple realities
Multiple dimensions vs realities
3. Time-sensitive routing Each node measures round-trip time (RTT) to each of its neighbors Routing message is forwarded to the neighbor with the maximum ratio of progress to RTT Advantage: – reduces per-hop latency (25-40% depending on # dimensions)
4. Zone overloading Node maintains list of “peers” in same zone Zone not split until MAXPEERS are there Only least RTT neighbor maintained Replication or splitting of data across peers Advantages – Reduced per hop latency – Reduced path length (because sharing zones has same effect as reducing total number of nodes) – Improved fault tolerance (a zone is vacant only when all nodes fail)
Zone overload Avg of 2 8 to 2 18 nodes
5. Multiple hash (key, value) mapped to “k” multiple points (and nodes) Parallel query execution on “k” paths Or choose closest one in space Advantages: – improve data availability – reduce path length
Multiple hash functions
6. Geographically distributed binning Distributed binning of CAN nodes based on their relative distances from m IP landmarks (such as root DNS servers) m! partitions of coordinate space New node joins partition associated with its landmark ordering Topological close nodes will be assigned to same partition Nodes in same partition divide the space up randomly Disadvantage: coordinate space is no longer uniformily populated – background load balancing could alleviate this problem
Distributed binning
7. More uniform partitioning When adding a node to a zone, choose neighbor with largest volume to split Advantage – 85% of nodes have the average load as opposed to 40% without, and largest zone volume dropped from 8x to 2x the average
8. Caching & Replication A node can cache values of keys it has recently requested. An node can push out “hot” keys to neighbors.
9. Load balancing Merge and split zones to balance loads
Outline Introduction Design Design Improvements Evalution Related Work Conclusion
Evaluation Metrics 1. Path length 2. Neighbor state 3. Latency 4. Volume 5. Routing fault tolerance 6. Data availability
CAN: scalability For a uniformly partitioned space with n nodes and d dimensions – per node, number of neighbors is 2d – average routing path is (dn 1/d ) hops Simulations show that the above results hold in practice (Transit-Stub topology: GT-ITM gen) Can scale the network without increasing per- node state
CAN: latency latency stretch = (CAN routing delay) (IP routing delay) To reduce latency, apply design improvements: – increase dimensions (d) – multiple nodes per zone (peer nodes) (p) – multiple “realities” ie coordinate spaces (r) – multiple hash functions (k) – use RTT weighted routing – distributed Binning – uniform Partitioning – caching & replication (for “hot spots”)
Overall improvement
CAN: Robustness Completely distributed – no single point of failure Not exploring database recovery Resilience of routing – can route around trouble
Routing resilience destination source
Routing resilience
destination
Routing resilience
Node X::route(D) If (X cannot make progress to D) – check if any neighbor of X can make progress – if yes, forward message to one such nbr Routing resilience
Outline Introduction Design Design Improvements Evalution Related Work Conclusion
Related Algorithms Distance Vector (DV) and Link State (LS) in IP routing – require widespread dissemination of topology – not well suited to dynamic topology like file- sharing Hierarchical routing algorithms – too centralized, single points of stress Other DHTs – Plaxton, Tapestry, Pastry, Chord
Related Systems DNS OceanStore (uses Tapestry) Publius: Web-publishing system with high anonymity P2P File sharing systems – Napster, Gnutella, KaZaA, Freenet
Outline Introduction Design Design Improvements Evalution Related Work Conclusion
Consclusion: Purpose CAN – an Internet-scale hash table – potential building block in Internet applications
Conclusion: Scalability O(d) per-node state O(d * n^(1/d)) path length For d = (log n)/2, path lengh and per-node state is O(log n) like other DHTs
Conclusion: Latency With the main design improvements latency is less than twice the IP path latency
Conclusion: Robustness Decentralized, can route around trouble With certain design improvements (replication), high data availability
Weaknesses / Future Work Security – Denial of service attacks hard to combat because a malicious node can act as a malicious server as well as a malicious client Mutable content Search