Nick McKeown CS244: Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001] Thanks to Ion Stoica.

Slides:



Advertisements
Similar presentations
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
CHORD: A Peer-to-Peer Lookup Service CHORD: A Peer-to-Peer Lookup Service Ion StoicaRobert Morris David R. Karger M. Frans Kaashoek Hari Balakrishnan Presented.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Speaker: Cathrin Weiß 11/23/2004 Proseminar Peer-to-Peer Information Systems.
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Xiaowei Yang CompSci 356: Computer Network Architectures Lecture 22: Overlay Networks Xiaowei Yang
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up protocol.
Distributed Hash Tables: Chord Brad Karp (with many slides contributed by Robert Morris) UCL Computer Science CS M038 / GZ06 27 th January, 2009.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion StoicaRobert Morris David Liben-NowellDavid R. Karger M. Frans KaashoekFrank.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Introduction to Peer-to-Peer (P2P) Systems Gabi Kliot - Computer Science Department, Technion Concurrent and Distributed Computing Course 28/06/2006 The.
Distributed Lookup Systems
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
EE 122: A Note On Joining Operation in Chord Ion Stoica October 20, 2002.
Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.
CS162 Operating Systems and Systems Programming Key Value Storage Systems November 3, 2014 Ion Stoica.
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Distributed Hash Tables Steve Ko Computer Sciences and Engineering University at Buffalo.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
CS162 Operating Systems and Systems Programming Lecture 14 Key Value Storage Systems March 12, 2012 Anthony D. Joseph and Ion Stoica
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
CS162 Operating Systems and Systems Programming Lecture 15 Key-Value Storage, Network Protocols March 20, 2013 Anthony D. Joseph
CS162 Operating Systems and Systems Programming Lecture 15 Key-Value Storage, Network Protocols October 22, 2012 Ion Stoica
Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]
CSE 486/586 Distributed Systems Distributed Hash Tables
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
1 Distributed Hash tables. 2 Overview r Objective  A distributed lookup service  Data items are distributed among n parties  Anyone in the network.
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Peer-to-Peer Information Systems Week 12: Naming
16 July 2015 Charles Reiss
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
CS 268: Lecture 22 (Peer-to-Peer Networks)
CSE 486/586 Distributed Systems Distributed Hash Tables
Distributed Hash Tables
A Scalable Peer-to-peer Lookup Service for Internet Applications
Peer-to-Peer Data Management
(slides by Nick Feamster)
Distributed File Systems 17: Distributed File Systems
EE 122: Peer-to-Peer (P2P) Networks
CS 268: Peer-to-Peer Networks and Distributed Hash Tables
DHT Routing Geometries and Chord
nthony D. Joseph and Ion Stoica
Chord Advanced issues.
Peer-to-Peer (P2P) File Systems
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
CS 162: P2P Networks Computer Science Division
Key-Value Tables: Chord and DynamoDB (Lecture 16, cs262a)
Peer-to-Peer Storage Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Consistent Hashing and Distributed Hash Table
CSE 486/586 Distributed Systems Distributed Hash Tables
A Scalable Peer-to-peer Lookup Service for Internet Applications
Peer-to-Peer Information Systems Week 12: Naming
Presentation transcript:

Nick McKeown CS244: Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001] Thanks to Ion Stoica

About the author Professor Ion Stoica, UC Berkeley 2

Hashing Q: How does a single-node hash table work? Q: When might we use one? 3

Hashing Single-node hash table: –key = hash(name) –put(key, value) –get(key)  value 4 hash(name) name “cat_photo.jpg” key 0x2bd5f802bddeb12 Value = Store

address 0xeb12 Hashing In a memory with n locations 5 hash(name) name “cat_photo.jpg” key 0x2bd5f802bddeb12 Value = Store n mod n

address 0xeb12 mod n Hashing In a node with m memories each with n locations 6 hash(name) name “cat_photo.jpg” key 0x2bd5f802bddeb12 Store 1 n Value = Store m n mod m store

Scaling hashing Q: If I need to distribute the values over m computers (“nodes”), why don’t I just make m and n large enough to hold m x n values? Q: How would it work with m nodes each with its own name or IP address? Q: When would it not work? 7

Why people care Motivated by Napster, Gnutella, Freenet Four simultaneous papers –Chord (Stoica et al. >12,000 citations ) –CAN (Ratnasamy et al.) –PASTRY (Rowstron et al.) –Tapestry (Zhao et al.)

Key Value Storage Handle huge volumes of data Store (key, value) tuples Simple interface –put(key, value); // insert/write “value” associated with “key” –value = get(key); // get/read data associated with “key” Used sometimes as a simpler but more scalable “database”

Amazon: –Key: customerID –Value: customer profile (e.g., buying history, credit card,..) Facebook, Twitter: –Key: UserID –Value: user profile (e.g. posting history, photos, friends, …) iCloud/iTunes: –Key: Movie/song name –Value: Movie, Song Key Values: Examples

System Examples Amazon –Dynamo: internal key value store used to power Amazon.com (shopping cart) –Simple Storage System (S3) BigTable/HBase/Hypertable: distributed, scalable data storage Cassandra: “distributed data management system” (developed by Facebook) Memcached: in-memory key-value store for small chunks of arbitrary data (strings, objects) eDonkey/eMule: peer-to-peer sharing system …

Key Value Store Also called a Distributed Hash Table (DHT) Main idea: partition set of key-values across many machines key, value …

Challenges Fault Tolerance: handle machine failures without losing data and without degradation in performance Scalability: –Need to scale to thousands of machines –Need to allow easy addition of new machines Consistency: maintain data consistency in face of node failures and message losses Heterogeneity (if deployed as peer-to-peer systems): –Latency: ms to secs –Bandwidth: kb/s to Gb/s …

Key Questions put(key, value): where to store new (key, value)? get(key): where to find value associated with “key”? And, do the above while providing –Fault Tolerance –Scalability –Consistency

Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory put(K14, V14)

Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory get(K14) V14

Directory-Based Architecture Having the master relay the requests  recursive query Another method: iterative query (below) –Return node to requester and let requester contact node … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory put(K14, V14) N3

Directory-Based Architecture Having the master relay the requests  recursive query Another method: iterative query –Return node to requester and let requester contact node … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory get(K14) V14 N3

Discussion: Iterative vs. Recursive Query Recursive Query Advantages: -Faster, as typically master/directory closer to nodes -Easier to maintain consistency, as master/directory can serialize puts()/gets() Disadvantages: scalability bottleneck, as all “Values” go through master/directory Iterative Query Advantages: more scalable Disadvantages: slower, harder to enforce data consistency … N1 N2N3N50 K14 V14 K14N3 Master/Directory get(K14) V14 … N1 N2N3N50 K14 V14 K14 N3 Master/Directory get(K14) V14 N3 Recursive Iterative

Fault Tolerance Replicate value on several nodes Usually, place replicas on different racks in a datacenter to guard against rack failures … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14) put(K14, V14), N1 N1, N3 K14V14 put(K14, V14)

Fault Tolerance Again, we can have –Recursive replication (previous slide) –Iterative replication (this slide) … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14) N1, N3 K14V14 put(K14, V14)

Fault Tolerance Or we can use recursive query and iterative replication… … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14) K14V14 put(K14, V14)

Scalability Storage: use more nodes Number of requests: –Can serve requests from all nodes on which a value is stored in parallel –Master can replicate a popular value on more nodes Master/directory scalability: –Replicate it –Partition it, so different keys are served by different masters/directories Q: How do you partition?

Scalability: Load Balancing Directory keeps track of storage availability at each node –Preferentially insert new values on nodes with more storage available What happens when a new node is added? –Cannot insert only new values on new node. Why? –Move values from heavy loaded nodes to the new node What happens when a node fails? –Need to replicate values from failed node to other nodes

Consistency Need to make sure that a value is replicated correctly How do we know a value has been replicated on every node? –Wait for acknowledgements from every node What happens if a node fails during replication? –Pick another node and try again What happens if a node is slow? –Slow down the entire put()? Pick another node? In general, with multiple replicas –Slow puts and fast gets

Consistency (cont’d) If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14’) K14V14 put(K14, V14’’) put(K14, V14’) put(K14, V14’') K14V14’’K14V14’ put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in reverse order What does get(K14) return? Undefined! put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in reverse order What does get(K14) return? Undefined!

Consistency (cont’d) Large variety of consistency models: –Atomic consistency (linearizability): reads/writes (gets/puts) to replicas appear as if there was a single underlying replica (single system image) Think: “one updated at a time” –Eventual consistency: given enough time all updates will propagate through the system One of the weakest form of consistency Used by many systems in practice –And many others: causal consistency, sequential consistency, strong consistency, …

Quorum Consensus Improve put() and get() operation performance 1.Define a replica set of size N 2.put() waits for acknowledgements from at least W replicas 3.get() waits for responses from at least R replicas 4.W+R > N Why does it work? –There is at least one node that contains the update Why might you use W+R > N+1?

Quorum Consensus Example N=3, W=2, R=2 Replica set for K14: {N 1, N 3, N 4 } Assume put() on N 3 fails N1N1 N2N2 N3N3 N4N4 K14V14K14V14 put(K14, V14) ACK put(K14, V14) ACK

Quorum Consensus Example Now, issuing get() to any two nodes out of three will return the answer N1N1 N2N2 N3N3 N4N4 K14V14K14V14 get(K14) V14 get(K14) nil

Scaling Up Directory Challenge: –Directory contains a number of entries equal to number of (key, value) tuples in the system –Can be hundreds of billions of entries in the system! Solution: consistent hashing Associate to each node a unique id in an uni- dimensional space 0..2 m -1 –Partition this space across m machines –Assume keys are in same uni-dimensional space –Each (Key, Value) is stored at the node with the smallest ID larger than Key

Key to Node Mapping Example m = 8  ID space: Node 8 maps keys [5,8] Node 15 maps keys [9,15] Node 20 maps keys [16, 20] … Node 4 maps keys [59, 4] V14 630

Scaling Up Directory With consistent hashing, directory contains only a number of entries equal to the number of nodes –Much smaller than number of tuples But: every query still needs to contact the directory Solution: distributed directory (a.k.a. lookup) service: –Given a key, find the node storing that key Key idea: route request from node to node until reaching the node storing the request’s key Key advantage: totally distributed –No point of failure; no hot spot

Chord: Distributed Lookup (Directory) Service Properties –Each node needs to know about O(log(M)), where M is the total number of nodes –Guarantees that a tuple is found in O(log(M)) steps

Lookup Each node maintains pointer to its successor Route packet (Key, Value) to the node responsible for ID using successor pointers E.g., 4 lookups for node responsible for Key= lookup(37) node=44 is responsible for Key=37

Stabilization Procedure Periodic operation performed by each node n to maintain its successor when new nodes join the system n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; // if x better successor, update succ.notify(n); // n tells successor about itself n.notify(n’) if (pred = nil or n’ (pred, n)) pred = n’; // if n’ is better predecessor, update n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; // if x better successor, update succ.notify(n); // n tells successor about itself n.notify(n’) if (pred = nil or n’ (pred, n)) pred = n’; // if n’ is better predecessor, update

Joining Operation Node with id=50 joins the ring Node 50 needs to know at least one node already in the system -Assume known node is 15 succ=4 pred=44 succ=nil pred=nil succ=58 pred=35

Joining Operation n=50 sends join(50) to node n=44 returns node n=50 updates its successor to 58 join(50) succ=4 pred=44 succ=nil pred=nil succ=58 pred=35 58 succ=58

Joining Operation n=50 executes stabilize() 2. n’s successor (58) returns x = 44 pred=nil succ=58 pred=35 x=44 succ=4 pred=44 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58

Joining Operation n=50 executes stabilize()  x = 44  succ = 58 pred=nil succ=58 pred=35 succ=4 pred=44 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58

Joining Operation n=50 executes stabilize()  x = 44  succ = 58 n=50 sends to its successor (58) notify(50) pred=nil succ=58 pred=35 succ=4 pred=44 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58 notify(50)

Joining Operation n=58 processes notify(50)  pred = 44  n’ = 50 pred=nil succ=58 pred=35 succ=4 pred=44 n.notify(n’) if (pred = nil or n’ (pred, n)) pred = n’ succ=58 notify(50)

Joining Operation n=58 processes notify(50)  pred = 44  n’ = 50 set pred = 50 pred=nil succ=58 pred=35 succ=4 pred=44 n.notify(n’) if (pred = nil or n’ (pred, n)) pred = n’ succ=58 notify(50) pred=50

Joining Operation n=44 runs stabilize() n’s successor (58) returns x = 50 pred=nil succ=58 pred=35 succ=4 pred=50 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58 x=50

Joining Operation n=44 runs stabilize()  x = 50  succ = 58 pred=nil succ=58 pred=35 succ=4 pred=50 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58

Joining Operation n=44 runs stabilize()  x = 50  succ = 58 n=44 sets succ=50 pred=nil succ=58 pred=35 succ=4 pred=50 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58 succ=50

Joining Operation n=44 runs stabilize() n=44 sends notify(44) to its successor pred=nil succ=50 pred=35 succ=4 pred=50 n.stabilize() x = succ.pred; if (x (n, succ)) succ = x; succ.notify(n); succ=58 notify(44)

Joining Operation n=50 processes notify(44)  pred = nil pred=nil succ=50 pred=35 succ=4 pred=50 n.notify(n’) if (pred = nil or n’ (pred, n)) pred = n’ succ=58 notify(44)

Joining Operation n=50 processes notify(44)  pred = nil n=50 sets pred=44 pred=nil succ=50 pred=35 succ=4 pred=50 n.notify(n’) if (pred = nil or n’ (pred, n)) pred = n’ succ=58 notify(44) pred=44

Joining Operation (cont’d) This completes the joining operation! succ=58 succ=50 pred=44 pred=50

Achieving Efficiency: finger tables ( ) mod 2 7 = 16 0 Say m=7 ith entry at peer with id n is first peer with id >= i ft[i] Finger Table at

Fault Tolerance for Lookup Service To improve robustness each node maintains the k (> 1) immediate successors instead of only one successor In the pred() reply message, node A can send its k-1 successors to its predecessor B Upon receiving pred() message, B can update its successor list by concatenating the successor list received from A with its own list If k = log(M), lookup operation works with high probability even if half of nodes fail, where M is number of nodes in the system

Storage Fault Tolerance Replicate tuples on successor nodes Example: replicate (K14, V14) on nodes 20 and V V14 14V14

Storage Fault Tolerance If node 15 fails, no reconfiguration needed –Still have two replicas –All lookups will be correctly routed Will need to add a new replica on node V V14 14V14

Iterative vs. Recursive Lookup Iteratively: –Example: node 40 issue query(31) Recursively –Example: node 40 issue query(31)

Conclusions: Key Value Store Very large scale storage systems Two operations –put(key, value) –value = get(key) Challenges –Fault Tolerance  replication –Scalability  serve get()’s in parallel; replicate/cache hot tuples –Consistency  quorum consensus to improve put() performance

Conclusions: Chord Very scalable distributed lookup protocol Each node needs to know about O(log(M)), where m is the total number of nodes Guarantees that a tuple is found in O(log(M)) steps Highly resilient: works with high probability even if half of nodes fail

Chord Analysis

Chord Discussion OpenDHT: Public DHT service Primary complaint: Too Slow! 99 th latency percentile measured in seconds Median latency sometimes over ½ second Q: What’s going on here? Rhea et al. FIXING the Embarrassing Slowness of OpenDHT on PlanetLab

DHT Performance Problems Few arbitrarily slow nodes cause the problem Disk reads can take tens of seconds Computations can be very slow on overloaded nodes Inter-node latencies can be high Q: What is the paradox here? Q: How can it be fixed?

Distributed Hash Tables Q: Why use this interface? put(index, value) get(index)  value Q: What are its shortcomings? Q: What are the potential uses?

Proposed Uses for DHTs 1. Global file systems [OceanStore, CFS, PAST, Pastiche, UsenetDHT] 2. Naming services [Chord-DNS, Twine, SFR] 3. DB query processing [PIER, Wisc] 4. Internet-scale data structures [PHT, Cone, SkipGraphs] 5. Communication services [i3, MCAN, Bayeux, VRR] 6. Event notification [Scribe, Herald] 7. File sharing [OverNet ]

Background: DHTs Q: What properties would we like a DHT to have? 1.Decentralized: no central authority 2.Scalable: low network traffic overhead 3.Fast: find items quickly 4.Fault-tolerant & dynamic: nodes fail, new nodes join 5.General-purpose: flexible naming 6.Secure

The Lookup Problem N1N1 N2N2 N4N4 N3N3 N5N5 put(key,value) value  get(key) ?

Napster N1N1 N2N2 N4N4 N3N3 N5N5 DB put(key,value) SetLoc(key,N 4 ) value  get(key)

Gnutella N1N1 N2N2 N4N4 N3N3 N5N5 IP  search(key)

Chord Q: What is Chord’s contribution?

Chord goals “The Chord project explored how to build scalable, robust distributed systems using peer-to-peer ideas. The basis for much of our work is the Chord distributed hash lookup primitive. Chord is completely decentralized and symmetric, and can find data using only log(N) messages, where N is the number of nodes in the system. Chord's lookup mechanism is provably robust in the face of frequent node failures and re-joins.” 70

Consistent Hashing [Karger ‘97] 71 N1 N32 N90 N105 Circular 7-bit ID space K5 K20 K80 A key is stored at its successor: the node with next higher ID Q: What are the consequences of how keys are distributed?

Basic Lookup 72 N1 N32 N90 N105 K80 “Where is key 80?” “N90 has key 80”

Basic Lookup Lookup(my-id, key-id) n = my successor if my-id < n < key-id call Lookup(key-id) on node n else return my successor //done (Correctness depends only on successors) 73 Q: How does Chord accelerate the lookup?

Lookup with Fingers 74 Finger i points to successor of n + 2 i (Fingers accelerate queries without affecting correctness)

Lookup with Finger Tables Lookup(my-id, key-id) look in local finger table for highest node n s.t. my-id < n < key-id if n exists call Lookup(key-id) on node n else return my successor // done Q: Fingers accelerate lookups. Do fingers affect correctness of the key? 75

Consistency during “joins” 76 stabilize(): - run periodically by each node. - fixes predecessor pointers.

Consistency Key consistency: All nodes agree on who holds a particular key Data consistency: All nodes agree on the value associated with a key Reachability consistency: All nodes are connected in the ring and can be reached in a lookup. (Chord aims to provide eventual reachability consistency, upon which key and data consistency are built.) 77

Controversy over correctness Using Lightweight Modeling To Understand Chord – P. Zave, ACM CCR April 2012 Claims eventual reachability not guaranteed. Counter-examples suggest simultaneous joins can lead to misordering in the ring. 78

Background DHTs Chord’s Interface key = hash(data) lookup(key)  IP address send-RPC(IP address, PUT, key, value) send-RPC(IP address, GET, key) -> value

Background DHTs Application DHT Lookup Service Node Put(key, value) Get(key) data Lookup(key) Node IP address

Chord Analysis

Chord Discussion OpenDHT: Public DHT service Primary complaint: Too Slow! 99 th latency percentile measured in seconds Median latency sometimes over ½ second Q: What’s going on here? Rhea et al. FIXING the Embarrassing Slowness of OpenDHT on PlanetLab

DHT Performance Problems Few arbitrarily slow nodes cause the problem Disk reads can take tens of seconds Computations can be very slow on overloaded nodes Inter-node latencies can be high Q: What is the paradox here? Q: How can it be fixed?