Beyond Theory: DHTs in Practice CS 194 - Distributed Systems Sean C. Rhea April 18, 2005 In collaboration with: Dennis Geels, Brighten Godfrey, Brad Karp,

Slides:



Advertisements
Similar presentations
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Storage management and caching in PAST Antony Rowstron and Peter Druschel Presented to cs294-4 by Owen Cooper.
Xiaowei Yang CompSci 356: Computer Network Architectures Lecture 22: Overlay Networks Xiaowei Yang
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
1 CS 268: Lecture 22 DHT Applications Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University of California,
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Handling Churn in a DHT USENIX Annual Technical Conference June 29, 2004 Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz UC Berkeley and.
X Non-Transitive Connectivity and DHTs Mike Freedman Karthik Lakshminarayanan Sean Rhea Ion Stoica WORLDS 2005.
Resource Allocation in OpenHash: a Public DHT Service Sean Rhea with Brad Karp, Sylvia Ratnasamy, Scott Shenker, Ion Stoica, and Harlan Yu.
OpenDHT: A Public DHT Service Sean C. Rhea UC Berkeley June 2, 2005 Joint work with: Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy, Scott.
Positive Feedback Loops in DHTs or Be Careful How You Simulate January 13, 2004 Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz From “Handling.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
Distributed Lookup Systems
Fall 2006CS 5611 Peer-to-Peer Networks Outline Overview Pastry OpenDHT Contributions from Peter Druschel & Sean Rhea.
CS218 – Final Project A “Small-Scale” Application- Level Multicast Tree Protocol Jason Lee, Lih Chen & Prabash Nanayakkara Tutor: Li Lao.
SkipNet: A Scaleable Overlay Network With Practical Locality Properties Presented by Rachel Rubin CS294-4: Peer-to-Peer Systems By Nicholas Harvey, Michael.
Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean Rhea, Byung-Gon Chun, John Kubiatowicz, and Scott Shenker UC Berkeley (and now MIT) December.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Application Layer Multicast for Earthquake Early Warning Systems Valentina Bonsi - April 22, 2008.
Wide-area cooperative storage with CFS
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Internet Indirection Infrastructure (i3) Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana UC Berkeley SIGCOMM 2002.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Beyond Theory: DHTs in Practice CS Networks Sean C. Rhea April 18, 2005 In collaboration with: Dennis Geels, Brighten Godfrey, Brad Karp, John Kubiatowicz,
A Public DHT Service Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy, Scott Shenker, Ion Stoica, and Harlan Yu UC Berkeley and.
Designing a DHT for low latency and high throughput Robert Vollmann P2P Information Systems.
Network Layer (3). Node lookup in p2p networks Section in the textbook. In a p2p network, each node may provide some kind of service for other.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
The Impact of DHT Routing Geometry on Resilience and Proximity K. Gummadi, R. Gummadi..,S.Gribble, S. Ratnasamy, S. Shenker, I. Stoica.
Information-Centric Networks07a-1 Week 7 / Paper 1 Internet Indirection Infrastructure –Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh.
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Hongil Kim E. Chan-Tin, P. Wang, J. Tyra, T. Malchow, D. Foo Kune, N. Hopper, Y. Kim, "Attacking the Kad Network - Real World Evaluation and High.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06.
Peer to Peer Network Design Discovery and Routing algorithms
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Open DHT: A Public DHT Service for Developers of Distributed Applications University of California, Irvine Presented By : Ala Khalifeh
Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]
Data Structure of Chord  For each peer -successor link on the ring -predecessor link on the ring -for all i ∈ {0,..,m-1} Finger[i] := the peer following.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Internet Indirection Infrastructure (i3)
Distributed Hash Tables
P2P-SIP Using an External P2P network (DHT)
EE 122: Peer-to-Peer (P2P) Networks
DHT Routing Geometries and Chord
P2P: Distributed Hash Tables
Peer-to-Peer Networks
Presentation transcript:

Beyond Theory: DHTs in Practice CS Distributed Systems Sean C. Rhea April 18, 2005 In collaboration with: Dennis Geels, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy, Timothy Roscoe, Scott Shenker, Ion Stoica, and Harlan Yu

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Talk Outline Bamboo: a churn-resilient DHT –Churn resilience at the lookup layer [USENIX’04] –Churn resilience at the storage layer [Cates’03], [Unpublished] OpenDHT: the DHT as a service –Finding the right interface [IPTPS’04] –Protecting against overuse [Under Submission] Future work

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Making DHTs Robust: The Problem of Membership Churn In a system with 1,000s of machines, some machines failing / recovering at all times This process is called churn Without repair, quality of overlay network degrades over time A significant problem deployed peer-to- peer systems

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service How Bad is Churn in Real Systems? AuthorsSystems ObservedSession Time SGG02Gnutella, Napster 50% < 60 minutes CLL02Gnutella, Napster 31% < 10 minutes SW02FastTrack 50% < 1 minute BSV03Overnet 50% < 60 minutes GDS03Kazaa 50% < 2.4 minutes time arrivedepartarrivedepart Session Time Lifetime An hour is an incredibly short MTTF!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Refresher: DHT Lookup/Routing K V k1,v1k1,v1 put(k 1,v 1 ) Put and get must find the same machine k1k1 v1v1 get(k 1 )

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Can DHTs Handle Churn? A Simple Test Start 1,000 DHT processes on a 80-CPU cluster –Real DHT code, emulated wide-area network –Models cross traffic and packet loss Churn nodes at some rate Every 10 seconds, each machine asks: “Which machine is responsible for key k?” –Use several machines per key to check consistency –Log results, process them after test

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Test Results In Tapestry (the OceanStore DHT), overlay partitions –Leads to very high level of inconsistencies –Worked great in simulations, but not on more realistic network And the problem isn’t limited to Tapestry: FreePastryMIT Chord

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The Bamboo DHT Forget about comparing Chord-Pastry-Tapestry –Too many differing factors –Hard to isolate effects of any one feature Instead, implement a new DHT called Bamboo –Same overlay structure as Pastry –Implements many of the features of other DHTs –Allows testing of individual features independently

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service How Bamboo Handles Churn (Overview) 1.Routes around suspected failures quickly –Abnormal latencies indicate failure or congestion –Route around them before we can tell difference 2.Recovers failed neighbors periodically –Keeps network load independent of churn rate –Prevents overlay-induced positive feedback cycles 3.Chooses neighbors for network proximity –Minimizes routing latency in non-failure case –Allows for shorter timeouts

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Bamboo Basics: Partition Key Space Each node in DHT will store some k,v pairs Given a key space K, e.g. [0, ): –Choose an identifier for each node, id i  K, uniformly at random –A pair k,v is stored at the node whose identifier is closest to k

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Bamboo Basics: Build Overlay Network Each node has two sets of neighbors Immediate neighbors in the key space –Important for correctness Long-hop neighbors –Allow puts/gets in O(log n) hops

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Bamboo Basics: Route Puts/Gets Thru Overlay Route greedily, always making progress k get(k)

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Routing Around Failures Under churn, neighbors may have failed To detect failures, acknowledge each hop k ACK

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Routing Around Failures If we don’t receive an ACK, resend through different neighbor k Timeout!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Computing Good Timeouts Must compute timeouts carefully –If too long, increase put/get latency –If too short, get message explosion k Timeout!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Computing Good Timeouts Chord errs on the side of caution –Very stable, but gives long lookup latencies k Timeout!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Computing Good Timeouts Keep past history of latencies –Exponentially weighted mean, variance Use to compute timeouts for new requests –timeout = mean + 4  variance When a timeout occurs –Mark node “possibly down”: don’t use for now –Re-route through alternate neighbor

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Timeout Estimation Performance

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Recovering From Failures Can’t route around failures forever –Will eventually run out of neighbors Must also find new nodes as they join –Especially important if they’re our immediate predecessors or successors: responsibility

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Recovering From Failures Can’t route around failures forever –Will eventually run out of neighbors Must also find new nodes as they join –Especially important if they’re our immediate predecessors or successors: old responsibility new responsibility new node

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Recovering From Failures Obvious algorithm: reactive recovery –When a node stops sending acknowledgements, notify other neighbors of potential replacements –Similar techniques for arrival of new nodes B CD A A

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Recovering From Failures Obvious algorithm: reactive recovery –When a node stops sending acknowledgements, notify other neighbors of potential replacements –Similar techniques for arrival of new nodes B CD A A B failed, use D B failed, use A

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The Problem with Reactive Recovery What if B is alive, but network is congested? –C still perceives a failure due to dropped ACKs –C starts recovery, further congesting network –More ACKs likely to be dropped –Creates a positive feedback cycle B CD A A B failed, use D B failed, use A

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The Problem with Reactive Recovery What if B is alive, but network is congested? This was the problem with Pastry –Combined with poor congestion control, causes network to partition under heavy churn B CD A A B failed, use D B failed, use A

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors B CD A A my neighbors are A, B, D, and E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors B CD A A my neighbors are A, B, D, and E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors –Breaks feedback loop B CD A A my neighbors are A, B, D, and E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors –Breaks feedback loop –Converges in logarithmic number of periods B CD A A my neighbors are A, B, D, and E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Periodic Recovery Performance Reactive recovery expensive under churn Excess bandwidth use leads to long latencies

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Proximity Neighbor Selection (PNS) For each neighbor, may be many candidates –Choosing closest with right prefix called PNS

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Proximity Neighbor Selection (PNS) 0… 10… 110… 111…

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Proximity Neighbor Selection (PNS) For each neighbor, may be many candidates –Choosing closest with right prefix called PNS Tapestry has sophisticated algorithms for PNS –Provable nearest neighbor under some assumptions –Nearest neighbors give constant stretch routing –But reasonably complicated implementation Can we do better?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service How Important is PNS? Only need leaf set for correctness –Must know predecessor and successor to determine what keys a node is responsible for Any filled routing table gives efficient lookups –Need one neighbor that shares no prefix, one that shares one bit, etc., but that’s all Insight: treat PNS as an optimization only –Find initial neighbor set using lookup

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service PNS by Random Sampling Already looking for new neighbors periodically –Because doing periodic recovery Can use results for random sampling –Every period, find potential replacement with lookup –Compare latency with existing neighbor –If better, swap

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Random sampling almost as good as everything else –24% latency improvement free –42% improvement for 40% more b.w. –Compare to 68%-84% improvement by using good timeouts PNS Results

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Talk Outline Bamboo: a churn-resilient DHT –Churn resilience at the lookup layer –Churn resilience at the storage layer OpenDHT: the DHT as a service –Finding the right interface –Protecting against overuse Future work

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service A Reliable Storage Layer Don’t just want to do lookup, also want to do –DHASH’s put/get: store (key, value) pairs –Tapestry’s publish/route: store (key, pointer) pairs Problem statement: Keep data/pointers available despite churn

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Why Storage Is Tricky The claim: DHT replicates within leaf set –A pair (k,v) is stored by the node closest to k and replicated on its successor and predecessor Why is this hard? k

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service DHTs Meet Epidemics Candidate Algorithm –For each (k,v) stored locally, compute SHA(k.v) –Every period, pick a random leaf set neighbor –Ask neighbor for all its hashes –For each unrecognized hash, ask for key and value This is an epidemic algorithm –All m members will have all (k,v) in log m periods –But as is, the cost is O(C), where C = disk size (although the constant is pretty small)

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Merkle Trees Merkle trees are an efficient summary technique –Interior nodes are the secure hashes of their children –E.g., I = SHA(A.B), N = SHA(K.L), etc. ABCDEFGH I=SHA(A.B)JKL N=SHA(K.L)M R

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Merkle Trees Merkle trees are an efficient summary technique –If the top node is signed and distributed, this signature can later be used to verify any individual block, using only O(log n) nodes, where n = # of leaves –E.g., to verify block C, need only R, M, N, I, J, C, & D ABCDEFGH IJKL NM R

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Using Merkle Trees as Summaries Improvement: use Merkle tree to summarize keys –B gets tree root from A, if same as local root, done –Otherwise, recurse down tree to find difference B’s values:A’s values: [0, ) [0, )[2 159, ) [0, ) [0, )[2 159, )

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Using Merkle Trees as Summaries Improvement: use Merkle tree to summarize keys –B gets tree root from A, if same as local root, done –Otherwise, recurse down tree to find difference New cost is O(d log C) –d = number of differences, C = size of disk B’s values:A’s values: [0, ) [0, )[2 159, ) [0, ) [0, )[2 159, )

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Using Merkle Trees as Summaries Still too costly: –If A is down for an hour, then comes back, changes will be randomly scattered throughout tree B’s values:A’s values: [0, ) [0, )[2 159, ) [0, ) [0, )[2 159, )

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Using Merkle Trees as Summaries Still too costly: –If A is down for an hour, then comes back, changes will be randomly scattered throughout tree Solution: order values by time instead of hash –Localizes values to one side of tree B’s values:A’s values: [0, 2 64 ) [0, 2 63 )[2 63, 2 64 ) [0, 2 64 ) [0, 2 63 )[2 63, 2 64 )

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service PlanetLab Deployment Been running Bamboo / OpenDHT on PlanetLab since April 2004 Constantly run a put/get test –Every second, put a value (with a TTL) –DHT stores 8 replicas of each value –Every second, get some previously put value (that hasn’t expired) Tests both routing correctness and replication algorithms (latter not discussed here)

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Excellent Availability Only 28 of 7 million values lost in 3 months –Where “lost” means unavailable for a full hour On Feb. 7, 2005, lost 60/190 nodes in 15 minutes to PL kernel bug, only lost one value

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Talk Outline Bamboo: a churn-resilient DHT –Churn resilience at the lookup layer –Churn resilience at the storage layer OpenDHT: the DHT as a service –Finding the right interface –Protecting against overuse Future work

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service A Small Sample of DHT Applications Distributed Storage Systems –CFS, HiveCache, PAST, Pastiche, OceanStore, Pond Content Distribution Networks / Web Caches –Bslash, Coral, Squirrel Indexing / Naming Systems –Chord-DNS, CoDoNS, DOA, SFR Internet Query Processors –Catalogs, PIER Communication Systems –Bayeux, i3, MCAN, SplitStream

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Questions: How many DHTs will there be? Can all applications share one DHT?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Benefits of Sharing a DHT Amortizes costs across applications –Maintenance bandwidth, connection state, etc. Facilitates “bootstrapping” of new applications –Working infrastructure already in place Allows for statistical multiplexing of resources –Takes advantage of spare storage and bandwidth Facilitates upgrading existing applications –“Share” DHT between application versions

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Challenges in Sharing a DHT Robustness –Must be available 24/7 Shared Interface Design –Should be general, yet easy to use Resource Allocation –Must protect against malicious/over-eager users Economics –What incentives are there to provide resources?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The DHT as a Service K V

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The DHT as a Service K V OpenDHT

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The DHT as a Service OpenDHT Clients

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The DHT as a Service OpenDHT

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The DHT as a Service OpenDHT What is this interface?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service It’s not lookup() lookup(k) k What does this node do with it? Challenges: 1.Distribution 2.Security

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service How are DHTs Used? 1.Storage –CFS, UsenetDHT, PKI, etc. 2.Rendezvous –Simple: Chat, Instant Messenger –Load balanced: i3 –Multicast: RSS Aggregation, White Board –Anycast: Tapestry, Coral

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service What about put/get? Works easily for storage applications Easy to share –No upcalls, so no code distribution or security complications But does it work for rendezvous? –Chat? Sure: put(my-name, my-IP) –What about the others?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Recursive Distributed Rendezvous Idea: prove an equivalence between lookup and put/get –We know we can implement put/get on lookup –Can we implement lookup on put/get? It turns out we can –Algorithm is called Recursive Distributed Rendezvous (ReDiR)

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) H(namespace) L0 L1 L2 H(A) A A A H(B)

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A A H(B)H(C) C

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A, C A H(B)H(C) C H(D) D D

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D H(E) E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D, E H(E) E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Join cost: –Worst case: O(log n) puts and gets –Average case: O(1) puts and gets L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D, E H(E) E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D, E H(E) E H(k1) successor

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D, E H(E) E H(k2) no successor successor

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Goal: Implement two functions using put/get: –join(namespace, node) –node = lookup(namespace, identifier) L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D, E H(E) E H(k3) no successor successor no successor

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Lookup cost: –Worst case: O(log n) gets –Average case: O(1) gets L0 L1 L2 H(A) A, B A, C A, D H(B)H(C) C H(D) D D, E H(E) E

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service ReDiR Performance (On PlanetLab)

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service OpenDHT Service Model Storage Applications: –Just use put/get Rendezvous Applications: –You provide the nodes –We provide cheap, scalable rendezvous

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Talk Outline Bamboo: a churn-resilient DHT –Churn resilience at the lookup layer –Churn resilience at the storage layer OpenDHT: the DHT as a service –Finding the right interface –Protecting against overuse Future work

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Protecting Against Overuse Must protect system resources against overuse –Resources include network, CPU, and disk –Network and CPU straightforward –Disk harder: usage persists long after requests Hard to distinguish malice from eager usage –Don’t want to hurt eager users if utilization low Number of active users changes over time –Quotas are inappropriate

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Fair Storage Allocation Our solution: give each client a fair share –Will define “fairness” in a few slides Limits strength of malicious clients –Only as powerful as they are numerous Protect storage on each DHT node separately –Must protect each subrange of the key space –Rewards clients that balance their key choices

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service The Problem of Starvation Fair shares change over time –Decrease as system load increases time Client 1 arrives fills 50% of disk Client 2 arrives fills 40% of disk Client 3 arrives max share = 10% Starvation!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Preventing Starvation Simple fix: add time-to-live (TTL) to puts –put (key, value)  put (key, value, ttl) –(A different approach is used by Palimpsest.) Prevents long-term starvation –Eventually all puts will expire

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Preventing Starvation Simple fix: add time-to-live (TTL) to puts –put (key, value)  put (key, value, ttl) –(A different approach is used by Palimpsest.) Prevents long-term starvation –Eventually all puts will expire Can still get short term starvation time Client A arrives fills entire of disk Client B arrives asks for space Client A’s values start expiring B Starves

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Preventing Starvation Stronger condition: Be able to accept r min bytes/sec new data at all times This is non-trivial to arrange! Reserved for future puts. Slope = r min Candidate put TTL size Sum must be < max capacity time space max 0 now

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Preventing Starvation Stronger condition: Be able to accept r min bytes/sec new data at all times This is non-trivial to arrange! TTL size time space max 0 now TTL size time space max 0 now Violation!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Preventing Starvation Formalize graphical intuition: f(  ) = B(t now ) - D(t now, t now +  ) + r min   To accept put of size x and TTL l: f(  ) + x < C for all 0 ≤  < l Can track the value of f efficiently with a tree –Leaves represent inflection points of f –Add put, shift time are O(log n), n = # of puts

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Fair Storage Allocation Per-client put queues Queue full: reject put Not full: enqueue put Select most under- represented Wait until can accept without violating r min Store and send accept message to client The Big Decision: Definition of “most under-represented”

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Defining “Most Under-Represented” Not just sharing disk, but disk over time –1 byte put for 100s same as 100 byte put for 1s –So units are bytes  seconds, call them commitments Equalize total commitments granted? –No: leads to starvation –A fills disk, B starts putting, A starves up to max TTL time Client A arrives fills entire of disk Client B arrives asks for space B catches up with A Now A Starves!

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Defining “Most Under-Represented” Instead, equalize rate of commitments granted –Service granted to one client depends only on others putting “at same time” time Client A arrives fills entire of disk Client B arrives asks for space B catches up with A A & B share available rate

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Defining “Most Under-Represented” Instead, equalize rate of commitments granted –Service granted to one client depends only on others putting “at same time” Mechanism inspired by Start-time Fair Queuing –Have virtual time, v(t) –Each put gets a start time S(p c i ) and finish time F(p c i ) F(p c i ) = S(p c i ) + size(p c i )  ttl(p c i ) S(p c i ) = max(v(A(p c i )) - , F(p c i-1 )) v(t) = maximum start time of all accepted puts

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service FST Performance

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Talk Outline Bamboo: a churn-resilient DHT –Churn resilience at the lookup layer –Churn resilience at the storage layer OpenDHT: the DHT as a service –Finding the right interface –Protecting against overuse Future work

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Future Work: Throughput High DHT throughput remains a challenge –Each put/get can be to a different destination node Only one existing solution (STP) –Assumes client’s access link is bottleneck

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Future Work: Throughput High DHT throughput remains a challenge –Each put/get can be to a different destination node Only one existing solution (STP) –Assumes client’s access link is bottleneck Have complete control of DHT routers –Can do fancy congestion control: maybe ECN? Have many available paths –Take advantage for higher throughput: mTCP?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Future Work: Upcalls OpenDHT makes a great common substrate for: –Soft-state storage –Naming and rendezvous Many P2P applications also need to: –Traverse NATs –Redirect packets within the infrastructure (as in i3) –Refresh puts while intermittently connected All of these can be implemented with upcalls –Who provides the machines that run the upcalls?

April 14, 2005Sean C. RheaOpenDHT: A Public DHT Service Future Work: Upcalls We don’t want to add upcalls to the core DHT –Keep the main service simple, fast, and robust Can we build a separate upcall service? –Some other set of machines organized with ReDiR –Security: can only accept incoming connections, can’t write to local storage, etc. This should be enough to implement –NAT traversal, reput service –Some (most?) packet redirection What about more expressive security policies?

For more information, see