Peer-to-Peer Systems and Distributed Hash Tables

Peer-to-Peer Systems and Distributed Hash Tables
COS 518: Advanced Computer Systems Lecture 15 Daniel Suo [Credit: All slides copied wholesale from Kyle Jamieson and Mike Freedman. Selected content adapted from B. Karp, R. Morris]

Today Peer-to-Peer Systems Distributed Hash Tables
Napster, Gnutella, BitTorrent, challenges Distributed Hash Tables The Chord Lookup Service Concluding thoughts on DHTs, P2P We’ll begin today with peer-to-peer systems from 15 years back like Napster and Gnutella. Then we’ll talk about a new data structure that scaled well, called Distributed Hash Table. Finish with a service that maps individual data items onto nodes in a peer to peer system called Chord. Eventually: how P2P influenced industrial large-scale distributed systems.

What is a Peer-to-Peer (P2P) system?
Node Node Node Internet Users’ computers talk directly to each other to implement service, in contrast to the client-server model where users’ clients talk to central servers. EXAMPLES: Skype, video and music players, file sharing. SEGUE: So why might this be a good idea? Node Node A distributed system architecture: No centralized control Nodes are roughly symmetric in function Large number of unreliable nodes

Why might P2P be a win? High capacity for services through parallelism: Many disks Many network connections Many CPUs Absence of a centralized server or servers may mean: Less chance of service overload as load increases Easier deployment A single failure won’t wreck the whole system System as a whole is harder to attack From a performance perspective, the promise of peer to peer computing was to leverage thousands to millions of nodes across the Internet to increase capacity of services. >>> As more join, system gets more resources. Contrast with cli-server where more users  resources more taxed. >>> Nodes can join/leave just by talking to random peer already in. So no need to set up a server and provision infrastructure. >>> >>> Harder to attack: not as simple as hacking single server/server group.

P2P adoption Successful adoption in some niche areas –
Client-to-client (legal, illegal) file sharing Popular data but owning organization has no money Digital currency: no natural single owner (Bitcoin) Voice/video telephony: user to user anyway Issues: Privacy and control So, the result has been successful adoption in some niche areas. 1. Networks like Napster music-sharing (ca. 1999) and Gnutella began this in the illegal domain for music and movies, but more recently we’ve seen the adoption of BitTorrent for rapidly disseminating files among users. 2. Electronic currencies like Bitcoin have no central authority, so they use a P2P network. 3. Up until 2012 Skype used P2P to route calls between users, but this has since changed to be hosted Linux boxes because of security reasons.

Example: Classic BitTorrent
User clicks on download link Gets torrent file with content hash, IP addr of tracker User’s BitTorrent (BT) client talks to tracker Tracker tells it list of peers who have file User’s BT client downloads file from one or more peers User’s BT client tells tracker it has a copy now, too User’s BT client serves the file to others for a while e.g. let’s look at how a popular cooperative file-sharing system, BitTorrent, works. Say you’re the user, want to download latest Linux kernel. You click on torrent download link. This gets a torrent file that contains hashes of chunks of the file, and IP address of a tracker node. Tracker node is web server to coordinate group of peers. The client sends an HTTP request to the tracker telling it which file it wants, then the tracker responds with a list of peers who have the file. D/L the file from peers, in parallel. Note only metadata--not file data -- are exchanged with centralized server, and only during the transient lifetime of a download. -- Client switches roles  “server” peers to serve others. Provides huge download bandwidth, without expensive server or network links

put(“Pacific Rim.mp4”, [content])
The lookup problem get(“Pacific Rim.mp4”) N2 N3 Client N1 Internet ? At the core of all peer to peer systems is the following problem. Some publisher does a put associating some key, say “Star Wars.mov”, with some value. Now some client comes along and does a get, looking for the information about the file. How does the client find out where that information went? Non-trivial, 1000s of nodes in P2P network failing and coming back up, the set of nodes that have the movie may change over time. Publisher (N4) N6 N5 put(“Pacific Rim.mp4”, [content])

Centralized lookup (Napster)
Client N1 Napster (1999) was one of the first music sharing networks: solution was single centralized database to store all the location information. >>> Database holds IP address of publisher, e.g. N4 for Star Wars. >>> Client does a lookup on db, can contact N4 directly to get content. >>> SEGUE: So now we’ve solved our rendezvous problem but at the expense of the database having to keep a lot of state, means it’s hard to keep the state up to date. It’s also a single point of failure, and it’s also a legal target, which is exactly what happened to Napster in the end. Lookup(“Pacific Rim.mp4”) SetLoc(“Pacific Rim.mp4”, IP address of N4) DB Simple, but O(N) state and a single point of failure Publisher (N4) N6 N5 key=“Pacific Rim.mp4”, value=[content]

Flooded queries (original Gnutella)
Lookup(“Pacific Rim.mp4”) N2 N3 Client N1 Opposite end of the spectrum: abandon any structure altogether, just flood queries throughout entire net. So everyone maintains a small number of connected peers and a query gets flooded to all connected peers until it finds the content. This is the approach the original Gnutella took in 2000, but it doesn’t scale. >>> SEGUE: It’s robust because there’s no central database to fail, so truly P2P. But might require a number of messages on the order of the number of peers in the network, so didn’t scale. Robust, but O(N = number of peers) messages per lookup Publisher (N4) N6 N5 key=“Star Wars.mov”, value=[content]

Routed DHT queries (Chord)
Lookup(H(audio data)) N2 N3 Client N1 Seen two ends of the spectrum (centralized lookup, central point of failure) and flooding (performance problems). >>> So what we’d like is to route queries between peers instead, in an efficient way, and make the whole system scale with reasonable state at each node. Publisher (N4) N6 N5 Can we make it robust, reasonable state, reasonable number of hops? key=“H(audio data)”, value=[content]

The Chord Lookup Service Concluding thoughts on DHTs, P2P So that's exactly what distributed hash tables and Chord work together to accomplish.

What is a DHT (and why)? Local hash table: key = Hash(name)
put(key, value) get(key)  value Service: Constant-time insertion and lookup Suppose want to associate some KEY with VALUE. Recall single-node hash table stores data key that is hash of a name. Provides PUT to store value under key, GET to fetch value from key. >>> SEGUE: If you ask the question of how to accomplish this using millions of hosts on the Internet, the answer is the Distributed Hash Table. How can I do (roughly) this across millions of hosts on the Internet? Distributed Hash Table (DHT)

What is a DHT (and why)? Distributed Hash Table: key = hash(data)
lookup(key)  IP addr (Chord lookup service) send-RPC(IP address, put, key, data) send-RPC(IP address, get, key)  data Partitioning data in truly large-scale distributed systems Tuples in a global database engine Data blocks in a global file system Files in a P2P file-sharing system So in a distributed hash table you run the data through a hash function to get its key, then Chord tells you the IP address of the server that should store that content. Issue RPCs to that server put/get the content. Consistency guarantees: if we have a put followed by a get, then the get probably reflects the put, but there is no strict guarantee of this.

Cooperative storage with a DHT
Distributed application put(key, data) get (key) data Distributed hash table (DHash) lookup(key) node IP address So the big picture is this. The app wants to use a hash table abstraction (THERE) to put/get data. Core software is in two layers: Dhash and Chord. DHASH fetches blocks, distributes blocks over the servers, maintains replicated copies. Dhash uses the CHORD lookup service to locate servers responsible for blocks. (Chord) Lookup service node …. App may be distributed over many nodes DHT distributes data storage over many nodes

BitTorrent over DHT BitTorrent can use DHT instead of (or with) a tracker BT clients use DHT: Key = file content hash (“infohash”) Value = IP address of peer willing to serve file Can store multiple values (i.e. IP addresses) for a key Client does: get(infohash) to find other clients willing to serve put(infohash, my-ipaddr) to identify itself as willing So now going back to BitTorrent, we can make it resilient to tracker failure by having all clients join the same DHT. WHY NOT CHORD LOOKUP SERVICE: need multi-vals

Why might DHT be a win for BitTorrent?
The DHT comprises a single giant tracker, less fragmented than many trackers So peers more likely to find each other Maybe a classic tracker too exposed to legal & c. attacks Less fragmented than many trackers, one per system.

Why the put/get DHT interface?
API supports a wide range of applications DHT imposes no structure/meaning on keys Key/value pairs are persistent and global Can store keys in other DHT values And thus build complex data structures

Why might DHT design be hard?
Decentralized: no central authority Scalable: low network traffic overhead Efficient: find items quickly (latency) Dynamic: nodes fail, new nodes join

The Chord Lookup Service Basic design Integration with DHash DHT, performance

Chord lookup algorithm properties
Interface: lookup(key)  IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per node Robust: survives massive failures Simple to analyze

Chord identifiers Key identifier = SHA-1(key)
Node identifier = SHA-1(IP address) SHA-1 distributes both uniformly How does Chord partition data? i.e., map key IDs to node IDs Chord wants to evenly partition the data onto servers, so it first hashes both the key and the IP address of every server using the same hash function. SEGUE: The cleverness is in how Chord partitions the data.

Consistent hashing [Karger ‘97]
Circular 7-bit ID space Key 5 Node 105 Uses a technique called Consistent Hashing invented by David Karger in All identifiers live in a single circular space. SEGUE: So let’s look at how nodes find each other in the Chord ring. Key is stored at its successor: node with next-higher ID

Chord: Successor pointers
Each node is connected to its successor by a direct pointer to the successor’s IP address that’s stored locally. This is called the successor pointer. N32 K80 N90 N60

Basic lookup N120 N10 “Where is K80?” “N90 has K80” N105 N32 K80 N90
When one node receives a query, it can forward the query to its successor, so the query moves around the ring. >>> When it reaches the node that has the key, that node replys directly to the node that asked, we keep the identity of the querying node in the query. N32 K80 N90 N60

Simple lookup algorithm
Lookup(key-id) succ  my successor if my-id < succ < key-id // next hop call Lookup(key-id) on succ else // done return succ Correctness depends only on successors So the lookup algorithm is called on a certain node in the chord ring. Tests if ID order is [its own id, THEN successor, THEN key], if so forward the query to the successor. (< is modulo on the ring.) Else [current, key, successor] then answer is the successor. Always undershoots to predecessor so never misses the real successor. SEGUE: n.successor must be correct! Otherwise we may skip over the responsible node, and get(k) won't see data inserted by put(k). Now, how about speed?

Improving performance
Problem: Forwarding through successor is slow Data structure is a linked list: O(n) Idea: Can we make it more like a binary search? Need to be able to halve distance at each step >>> Can we make it more like a binary search to speed up performance?

“Finger table” allows log N-time lookups
The data structure that Chord uses to do this is called a finger table. Finger table entries contain IP address and Chord ID of the target server. Nice property: One of the fingers takes you roughly half-way to target, so log-N time lookups with log-N hops. SEGUE: So for example… 1/8 1/16 1/32 1/64 N80

Finger i Points to Successor of n+2i
K112 Let’s see how finger tables let us route to this key, K112. It will be stored at N120, so the ¼ finger table entry will take us almost there without overshooting it. 1/8 1/16 1/32 1/64 N80

Implication of finger tables
A binary lookup tree rooted at every node Threaded through other nodes' finger tables This is better than simply arranging the nodes in a single tree Every node acts as a root So there's no root hotspot No single point of failure But a lot more state in total So if you think about the finger tables at all the nodes taken together… SEGUE: So here’s the detailed lookup algorithm with finger tables.

Lookup with finger table
Lookup(key-id) look in local finger table for highest n: my-id < n < key-id if n exists call Lookup(key-id) on node n // next hop else return my successor // done Still undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n), but finger table causes it to be.

Lookups Take O(log N) Hops
Maybe note that fingers point to the first relevant node. N99 N32 Lookup(K19) N80 N60

An aside: Is log(n) fast or slow?
For a million nodes, it’s 20 hops If each hop takes 50 milliseconds, lookups take a second If each hop has 10% chance of failure, it’s a couple of timeouts So in practice log(n) is better than O(n) but not great SEGUE: So we’ll come back to this point when we discuss Chord’s impact at the end.

Joining: Linked list insert
Joining is essentially a distributed linked list insertion. New node N36 knows another node in ring (N25). First, the new node looks itself up, starting at that node. That returns N40. N36 1. Lookup(36) K30 K38 N40

Join (2) N25 2. N36 sets its own successor pointer N36 K30 K38 N40

Join (3) 3. Copy keys 26..36 from N40 to N36 N25 N36 K30 K30 K38 N40
SEGUE: So now let’s see how N36 gets incorporated into the Chord ring. 3. Copy keys from N40 to N36 N36 K30 K30 K38 N40

Notify messages maintain predecessors
notify N25 Periodic notify messages each node sends to its successor maintain the predecessor pointers. When node N36 sends N40 a notify, N40 checks its predecessor (N25), and decides that N36 should be its predecessor instead. If N25 later sends N40 a notify, no-op. N36 N40 notify N36

Stabilize message fixes successor
N25 ✔ ✘ You send a stabilize message to who you THINK your successor is. stabilize N36 N40 “My predecessor is N36.”

Joining: Summary Predecessor pointer allows link to new node
Update finger pointers in the background Correct successors produce correct lookups

Failures may cause incorrect lookup
N10 is looking up key K90. No problem, until lookup gets to N80, which knows no live node before K90. There might be a replica of K90 at N113, but N80 can’t find it to make progress. N85 Lookup(K90) N80 N80 does not know correct successor, so incorrect lookup

Successor lists Each node stores a list of its r immediate successors
After failure, will know first live successor Correct successors guarantee correct lookups Guarantee is with some probability So rather than just one immediate successor, each node keeps track of its r immediately following successor nodes. SEGUE: So, how do we choose r?

Choosing successor list length
Assume one half of the nodes fail P(successor list all dead) = (½)r i.e., P(this node breaks the Chord ring) Depends on independent failure Successor list of size r = O(log N) makes this probability 1/N: low for large N

Lookup with fault tolerance
Lookup(key-id) look in local finger table and successor-list for highest n: my-id < n < key-id if n exists call Lookup(key-id) on node n // next hop if call failed, remove n from finger table and/or successor list return Lookup(key-id) else return my successor // done So the lookup algorithm with fault tolerance becomes this. Look in successor list as well. If the call fails, we remove the node from the finger table and/or successor list, and restart the lookup.

The Chord Lookup Service Basic design Integration with DHash DHT, performance

The DHash DHT Builds key/value storage on Chord
Replicates blocks for availability Stores k replicas at the k successors after the block on the Chord ring Caches blocks for load balancing Client sends copy of block to each of the servers it contacted along the lookup path Authenticates block contents SEGUE: Then to make it secure for a peer-to-peer setting, it authenticates block contents.

DHash data authentication
Two types of DHash blocks: Content-hash: key = SHA-1(data) Public-key: key is a cryptographic public key, data are signed by corresponding private key Chord File System example: So let’s take a closer look at how Dhash authenticates that the data it stores is from some known identity, i.e. some private key. DHash servers verify the chain of signatures rooted at the public key, before accepting put(key, value). Clients verify result of get(key) through the same chain. Whole thing is read-only, need to overwrite the root block to update.

DHash replicates blocks at r successors
N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Nodes with replicas are not likely to be near each other in underlying network. Replicas are easy to find if successor fails Hashed node IDs ensure independent failure

Experimental overview
Quick lookup in large systems Low variation in lookup costs Robust despite massive failure Goal: Experimentally confirm theoretical results

Chord lookup cost is O(log N)
Actually, the lookup cost is ½ log(N). Error bars: one std deviation, so the variation in the lookup costs is not too high. Average Messages per Lookup Number of Nodes Constant is 1/2

Failure experiment setup
Start 1,000 Chord servers Each server’s successor list has 20 entries Wait until they stabilize Insert 1,000 key/value pairs Five replicas of each Stop X% of the servers, immediately make 1,000 lookups So to see how well the successor list and data replication handles failure, we’ll look at a large scale experiment with 1,000 servers.

Massive failures have little impact
(1/2)6 is 1.6% Failed Lookups (Percent) This is before stabilization starts after we fail X% of the servers, so all lookup failures attributable to loss of all 6 replicas. Failed Nodes (Percent)

The Chord Lookup Service Basic design Integration with DHash DHT, performance Concluding thoughts on DHT, P2P

DHTs: Impact Original DHTs (CAN, Chord, Kademlia, Pastry, Tapestry) proposed in Following 5-6 years saw proliferation of DHT-based applications: Filesystems (e.g., CFS, Ivy, OceanStore, Pond, PAST) Naming systems (e.g., SFR, Beehive) DB query processing [PIER, Wisc] Content distribution systems (e.g., Coral) distributed databases (e.g., PIER)

Why don’t all services use P2P?
High latency and limited bandwidth between peers (cf. between server cluster in datacenter) User computers are less reliable than managed servers Lack of trust in peers’ correct behavior Securing DHT routing hard, unsolved in practice So returning to P2P systems, part of the problem is still the lookup problem. But other part is fundamentally users’ computers are less reliable and trusted than managed servers. This is exacerbated by the fact that users’ computers have flaky network connections, might be switched off, broken, & c. So they are hardly as reliable as managed servers. Some P2P services are open, meaning anyone can join, these can be vulnerable to certain kinds of attacks.

DHTs in retrospective Seem promising for finding data in large P2P systems Decentralization seems good for load, fault tolerance But: the security problems are difficult But: churn is a problem, particularly if log(n) is big So DHTs have not had the impact that many hoped for

What DHTs got right Consistent hashing
Elegant way to divide a workload across machines Very useful in clusters: actively used today in Amazon Dynamo and other systems Replication for high availability, efficient recovery after node failure Incremental scalability: “add nodes, capacity increases” Self-management: minimal configuration Unique trait: no single server to shut down/monitor

Pre-reading: Bayou paper (on website)
Wednesday topic: Eventual Consistency Pre-reading: Bayou paper (on website) 11:59 PM Wednesday: Assignment 2 Deadline On Wednesday we will look at a different type of consistency that relaxes the strong consistency that RAFT and Paxos provide.

Peer-to-Peer Systems and Distributed Hash Tables

Similar presentations

Presentation on theme: "Peer-to-Peer Systems and Distributed Hash Tables"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peer-to-Peer Systems and Distributed Hash Tables

Similar presentations

Presentation on theme: "Peer-to-Peer Systems and Distributed Hash Tables"— Presentation transcript:

Similar presentations

About project

Feedback