Designing a DHT for low latency and high throughput Robert Vollmann 30.11.2004 P2P Information Systems.

Slides:

Advertisements

Similar presentations

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Advertisements

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Scalable Content-Addressable Network Lintao Liu

Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.

CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.

PDPTA03, Las Vegas, June S-Chord: Using Symmetry to Improve Lookup Efficiency in Chord Valentin Mesaros 1, Bruno Carton 2, and Peter Van Roy 1 1.

The Chord P2P Network Some slides have been borowed from the original presentation by the authors.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Speaker: Cathrin Weiß 11/23/2004 Proseminar Peer-to-Peer Information Systems.

Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

Chord A Scalable Peer-to-peer Lookup Service for Internet Applications

Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up.

Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.

Self-Organizing Hierarchical Routing for Scalable Ad Hoc Networking David B. Johnson Department of Computer Science Rice University Monarch.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

Small-world Overlay P2P Network

Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.

Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.

P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.

Chord and CFS Philip Skov Knudsen Niels Teglsbo Jensen Mads Lundemann

University of Oregon Slides from Gotz and Wehrle + Chord paper

1 A Scalable Content- Addressable Network S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker Proceedings of ACM SIGCOMM ’01 Sections: 3.5 & 3.7.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean Rhea, Byung-Gon Chun, John Kubiatowicz, and Scott Shenker UC Berkeley (and now MIT) December.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Wide-area cooperative storage with CFS

Effizientes Routing in P2P Netzwerken Chord: A Scalable Peer-to- peer Lookup Protocol for Internet Applications Dennis Schade.

Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.

Network Layer (3). Node lookup in p2p networks Section in the textbook. In a p2p network, each node may provide some kind of service for other.

Wide-area cooperative storage with CFS Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica.

Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.

Presentation 1 By: Hitesh Chheda 2/2/2010. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science.

Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.

A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.

1 Vivaldi: A Decentralized Network Coordinate System Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris Presented by: Chen Qian.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.

SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.

1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

P2P Group Meeting (ICS/FORTH) Monday, 28 March, 2005 A Scalable Content-Addressable Network Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp,

1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.

Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,

Network Computing Laboratory 1 Vivaldi: A Decentralized Network Coordinate System Authors: Frank Dabek, Russ Cox, Frans Kaashoek, Robert Morris MIT Published.

Kademlia: A Peer-to-peer Information System Based on the XOR Metric

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.

Spring 2000CS 4611 Routing Outline Algorithms Scalability.

CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.

Peter Pham and Sylvie Perreau, IEEE 2002 Mobile and Wireless Communications Network Multi-Path Routing Protocol with Load Balancing Policy in Mobile Ad.

Distance Vector Routing

15-829A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Review.

TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).

Data Structure of Chord  For each peer -successor link on the ring -predecessor link on the ring -for all i ∈ {0,..,m-1} Finger[i] := the peer following.

CS694 - DHT1 Distributed Hash Table Systems Hui Zhang University of Southern California.

MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,

Peer-to-Peer Networks 04: Chord Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg.

The Chord P2P Network Some slides taken from the original presentation by the authors.

Distributed Hash Tables

S-Chord: Using Symmetry to Improve Lookup Efficiency in Chord

DHT Routing Geometries and Chord

Chord and CFS Philip Skov Knudsen

MIT LCS Proceedings of the 2001 ACM SIGCOMM Conference

Consistent Hashing and Distributed Hash Table

Kademlia: A Peer-to-peer Information System Based on the XOR Metric

Presentation transcript:

Designing a DHT for low latency and high throughput Robert Vollmann P2P Information Systems

Overview I. Introduction to DHT‘s II.Technical Background III.Latency IV.Throughput V.Summary

Distributed Hash Tables Hash table - Data structure that uses hash function to map keys to hash values to determine where the data is stored - Allows quick access to keys through lookup algorithm Distributed hash table - Hash table is distributed over all participating nodes - Lookup algorithm determines the node responsible for a specific key Requirements - Find data - High availability - Balance load - Fast moving of data Problems - Link capacity of nodes - Physical distance of nodes - Congestion of network and packet loss

Technical Background DHash++ - Values are mapped to keys using SHA-1 hash function - Stores key/value pairs (so-called blocks) on different nodes - Uses Chord and Vivaldi Chord - Lookup protocol to find keys with runtime O(log N) Vivaldi - Decentralized Network Coordinate System to compute and manage synthetic coordinates which are used to predict inter-node latencies - No additional traffic because synthetic coordinates can piggy-back on DHash++‘s communication patterns Hardware - Test-bed of 180 hosts which are connected via Internet2, DSL, cable or T1 Additional testing - Simulation with delay matrix filled with delays between 2048 DNS Servers

Low latency Data layout DHash++ is designed for read-heavy applications that demand low-latency and high throughput. Examples SFR (Semantic Free Referencing system) - Designed to replace DNS (Domain Name System) - Uses DHT to store small data records representing name bindings UsenetDHT - Distributed version of Usenet which splits large articles into small blocks to achieve load balancing DHash++: - Can be seen as a Network Storage System with shared global infrastructure - Uses small blocks of 8kb length - Uses random distribution of blocks via hash Function

Low latency Recursive vs. iterative Iterative lookup - Send lookup query to each successive node in lookup path - Can detect node failure But - Must wait for response before proceeding

Low latency Recursive vs. iterative Recursive lookup - Direct query forwarding to next node - Less queries -> less congestion But - Impossible to detect failed nodes

Low latency Recursive vs. iterative Left Figure: Simulation of 20,000 lookups with random hosts for random keys - Recursive lookup takes 0.6 times as long as iterative Right Figure: 1,000 lookups in test-bed - Result of simulation confirmed Trade-off: DHash++ uses recursive routing but switches to iterative routing after persistent link failures

Idea - Chose nearby nodes to decrease latency Realisation - ID-Space range of ith finger table entry of node a: - Every finger table entry points to first available ID in this range - Get the latency of the the first x available nodes in this ID-Space range from successor list - Route lookups through node with lowest latency What is a suitable value for x? Low latency Proximity neighbor selection

Right Figure: Simulation of lookups with random hosts for random keys - 1 – 16 Samples: highly decreasing latency - 16 – 2048 Samples: barely decreasing latency Right figure: lookups in test-bed - Decreased lookup latency  DHash++ uses 16 Samples

Low latency Erasure-coding vs. replication Erasure-coding - Data block split into l fragments - m different fragments are necessary to reconstruct the block - Redundant storage of data Replication - Node stores entire block - Special case: m = 1 and l is number of replicas - Redundant information spread over fewer nodes Comparison of both methods - r = l / m amount of redundancy Probability that a block p is available:

Low latency Erasure-coding vs. replication Replication - Only slightly lower latency than erasure-coding for same r if r is high - Less congestion than erasure-coding because less files have to be distributed Erasure-coding - Higher availability of fragments - More choice because of more fragments DHash++ uses Erasure-coding with m = 7 and l = 14

Low latency Integration Remember proximity neighbor selection - Little advantage in the last steps Also remember last step in lookup procedure - Originator contacts a key‘s direct predecessor to obtain successor list - But full successor list not neccessarily needed Why? - List has length s - l successors store fragments of the block - m fragments are needed - s-m predecessors of a key have lists with at least m nodes

Low latency Integration d :Number of successors in successor list with needed fragment Trade-off between lookup time and fetch latency while choosing d - Large d : more hops for lookup but more choice between nodes - Small d : less hops but higher fetch latency Optimum: d = l

High throughput Overview Requirements for a DHT - Parallel sending and receiving of data - Congestion control to avoid packet loss and re-transmissions - Recover from packet loss - Difficulty: Data is spread over a large set of servers  Efficient transport protocol needed First possibility: Use existing transport protocol TCP (Transport Control Protocol) - Provides congestion control, but - Optimal congestion control and timeout estimation require some time - Imposes start-up latency - Number of simultaneous connections limited Approach by DHash++ - Small number of TCP connections to neighbours - Whole communication over these neighbours

High throughput STP Second Possibility: Design alternative transport protocol STP (Striped Transport Protocol) - Receiving and transmitting data directly to other nodes in single instance - No per-destination states, decisions based on recent network behavior and synthetic coordinates (Vivaldi) Remote Procedure Call (RPC) - Calls a procedure that is located at another host on the network - Example: lookup calls procedure on host to retrieve finger table entry Round-trip-time (RTT) - The time it takes to send a packet to a host and receive a response - Used to measure delay on a network

High throughput STP Congestion Window Control - w simultaneous RPCs - New RPC only when old is finished - Congestion: If RPC is answered w is increased by 1/w, otherwise w is decreased by w/2 Retransmit timers - TCP predicts new RTT with deviation of old RTTs - In general no repeated sending of RPCs to the same node  must predict RTT before sending, therefore uses Vivaldi Retransmit policy - No direct resending if timer expires - Notifies application (DHash++)  On lookup: send to next-closest finger  On fetch: query successor that is next-closest in predicted latency

High throughput TCP vs. STP - Sequence of single random fetches of 24 nodes - Median fetch time: 192 ms with STP, 447ms with TCP - On average: 3 hops to complete lookup But - TCP fetch through 3 TCP connections consecutively - STP fetch through 1 single STP connection

High throughput TCP vs. STP - Simultaneous fetches of different nodes - Median throughput: 133 KB/s with TCP, 261 KB/s with STP - Same reason for result as in previous slide

Summary Different design decisions - Recursive routing  reducing number of sent packets - Proximity neighbor selection  searching nearby nodes - Erasure-coding  increasing availability of data - Integration  reducing number of hops for lookup Result: Reduced the lookup and fetch latency up to a factor of 2 Alternative Transport Protocol STP - Fitted to the needs of DHash++ - Direct connection between nodes  Higher throughput than TCP  Lower latency than TCP Result: Further reduced latency and optimized throughput by factor 2

Thanks for Your Attention