John Kubiatowicz Electrical Engineering and Computer Sciences

Slides:



Advertisements
Similar presentations
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Advertisements

Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
EECS 262a Advanced Topics in Computer Systems Lecture 21 Chord/Tapestry November 7 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Chord A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert MorrisDavid, Liben-Nowell, David R. Karger, M. Frans Kaashoek,
CHORD: A Peer-to-Peer Lookup Service CHORD: A Peer-to-Peer Lookup Service Ion StoicaRobert Morris David R. Karger M. Frans Kaashoek Hari Balakrishnan Presented.
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up protocol.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion StoicaRobert Morris David Liben-NowellDavid R. Karger M. Frans KaashoekFrank.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Handling Churn in a DHT USENIX Annual Technical Conference June 29, 2004 Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz UC Berkeley and.
Introduction to Peer-to-Peer (P2P) Systems Gabi Kliot - Computer Science Department, Technion Concurrent and Distributed Computing Course 28/06/2006 The.
Positive Feedback Loops in DHTs or Be Careful How You Simulate January 13, 2004 Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz From “Handling.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Distributed Lookup Systems
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Peer-to-Peer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Structure Overlay Networks and Chord Presentation by Todd Gardner Figures from: Ion Stoica, Robert Morris, David Liben- Nowell, David R. Karger, M. Frans.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Tapestry An off-the-wall routing protocol? Presented by Peter, Erik, and Morten.
CSE 461 University of Washington1 Topic Peer-to-peer content delivery – Runs without dedicated infrastructure – BitTorrent as an example Peer.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Presentation 1 By: Hitesh Chheda 2/2/2010. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
Peer to Peer Network Design Discovery and Routing algorithms
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Peer-to-Peer Information Systems Week 12: Naming
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Pastry Scalable, decentralized object locations and routing for large p2p systems.
The Chord P2P Network Some slides have been borrowed from the original presentation by the authors.
Distributed Hash Tables
A Scalable Peer-to-peer Lookup Service for Internet Applications
Controlling the Cost of Reliability in Peer-to-Peer Overlays
(slides by Nick Feamster)
EE 122: Peer-to-Peer (P2P) Networks
DHT Routing Geometries and Chord
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
P2P Systems and Distributed Hash Tables
MIT LCS Proceedings of the 2001 ACM SIGCOMM Conference
Consistent Hashing and Distributed Hash Table
P2P: Distributed Hash Tables
A Scalable Peer-to-peer Lookup Service for Internet Applications
Peer-to-Peer Information Systems Week 12: Naming
#02 Peer to Peer Networking
Simultaneous Insertions in Tapestry
Presentation transcript:

EECS 262a Advanced Topics in Computer Systems Lecture 21 Chord/Tapestry April 11th, 2016 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 CS252 S05

Today’s Papers Today: Peer-to-Peer Networks Thoughts? Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications, Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashoek, Frank Dabek, Hari Balakrishnan, Appears in Proceedings of the IEEE/ACM Transactions on Networking, Vol. 11, No. 1, pp. 17-32, February 2003 Tapestry: A Resilient Global-scale Overlay for Service Deployment, Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John D. Kubiatowicz.  Appears in IEEE Journal on Selected Areas in Communications, Vol 22, No. 1, January 2004 Today: Peer-to-Peer Networks Thoughts?

Peer-to-Peer: Fully equivalent components Peer-to-Peer has many interacting components View system as a set of equivalent nodes “All nodes are created equal” Any structure on system must be self-organizing Not based on physical characteristics, location, or ownership

Research Community View of Peer-to-Peer Old View: A bunch of flakey high-school students stealing music New View: A philosophy of systems design at extreme scale Probabilistic design when it is appropriate New techniques aimed at unreliable components A rethinking (and recasting) of distributed algorithms Use of Physical, Biological, and Game-Theoretic techniques to achieve guarantees

Early 2000: Why the hype??? File Sharing: Napster (+Gnutella, KaZaa, etc) Is this peer-to-peer? Hard to say. Suddenly people could contribute to active global network High coolness factor Served a high-demand niche: online jukebox Anonymity/Privacy/Anarchy: FreeNet, Publis, etc Libertarian dream of freedom from the man (ISPs? Other 3-letter agencies) Extremely valid concern of Censorship/Privacy In search of copyright violators, RIAA challenging rights to privacy Computing: The Grid Scavenge numerous free cycles of the world to do work Seti@Home most visible version of this Management: Businesses Businesses have discovered extreme distributed computing Does P2P mean “self-configuring” from equivalent resources? Bound up in “Autonomic Computing Initiative”?

The lookup problem N3 N2 N1 ? N6 N4 N5 Key=“title” Value=MP3 data… Internet (CyberSpace!) ? Client Publisher Lookup(“title”) N6 N4 1000s of nodes. Set of nodes may change… N5

Centralized lookup (Napster) Client Lookup(“title”) N4 Publisher@ N8 DB SetLoc(“title”,N4) Key=“title” Value=MP3 data… N7 N9 N6 O(N) state means its hard to keep the state up to date. Simple, but O(N) state and a single point of failure

Flooded queries (Gnutella) Client Lookup(“title”) N4 Publisher@ Key=“title” Value=MP3 data… N8 N6 N7 N9 Robust, but worst case O(N) messages per lookup

Routed queries (Freenet, Chord, Tapestry, etc.) Publisher@ N4 N5 Client Key=“title” Value=MP3 data… Lookup(“title”) N6 N8 Challenge: can we make it robust? Small state? Actually find stuff in a changing system? Consistent rendezvous point, between publisher and client. N7 Can be O(log N) messages per lookup (or even O(1)) Potentially complex routing state and maintenance.

Chord IDs Key identifier = 160-bit SHA-1(key) Node identifier = 160-bit SHA-1(IP address) Both are uniformly distributed Both exist in the same ID space How to map key IDs to node IDs? By “key” I usually mean “key identifier” Explain I Results

Consistent hashing [Karger 97] Key 5 K5 Node 105 N105 K20 Circular 160-bit ID space N32 N90 Ids live in a single circular space. K80 A key is stored at its successor: node with next higher ID

Basic lookup N120 N10 N105 N32 K80 N90 N60 “Where is key 80?” “N90 has K80” N32 K80 N90 Just need to make progress, and not overshoot. Will talk about initialization later. And robustness. Now, how about speed? N60

Simple lookup algorithm Lookup(my-id, key-id) n = my successor if my-id < n < key-id call Lookup(id) on node n // next hop else return my successor // done Correctness depends only on successors Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.

“Finger table” allows log(N)-time lookups ¼ ½ 1/8 1/16 1/32 1/64 1/128 N80 Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …

Finger i points to successor of n+2i 112 ¼ ½ 1/8 1/16 1/32 1/64 1/128 Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ … N80

Lookup with fingers Lookup(my-id, key-id) look in local finger table for highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop else return my successor // done Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.

Lookups take O(log(N)) hops Lookup(K19) N80 Maybe note that fingers point to the first relevant node. N60

Joining: linked list insert 1. Lookup(36) K30 K38 N40

Join (2) N25 2. N36 sets its own successor pointer N36 K30 K38 N40

Join (3) N25 3. Copy keys 26..36 from N40 to N36 N36 K30 K30 K38 N40

Join (4) Update finger pointers in the background 4. Set N25’s successor pointer N36 K30 K30 K38 N40 Update finger pointers in the background Correct successors produce correct lookups

Failures might cause incorrect lookup No problem until lookup gets to a node which knows of no node < key. There’s a replica of K90 at N113, but we can’t find it. N80 doesn’t know correct successor, so incorrect lookup

Solution: successor lists Each node knows r immediate successors After failure, will know first live successor Correct successors guarantee correct lookups Guarantee is with some probability For many systems, talk about “leaf set” The leaf set is a set of nodes around the “root” node that can handle all of the data/queries that the root nodes might handle When node fails: Leaf set can handle queries for dead node Leaf set queried to retreat missing data Leaf set used to reconstruct new leaf set

Lookup with Leaf Set Assign IDs to nodes Map hash values to node with closest ID Leaf set is successors and predecessors All that’s needed for correctness Routing table matches successively longer prefixes Allows efficient lookups Lookup ID Source 0… 10… 110… 111… Response Each node is assigned an ID randomly IDs are arranged into a ring by numerical order, top ID wraps around to 0 Each node maintains two sets of neighbors, its leaf set and routing table Leaf set is immediate predecessor and successor Routing table neighbors resolve successively longer matching prefixes of node’s own ID Lookup queries are routed greedily to node with closest matching ID (numerically, not lexigraphically) Response is sent directly to querying node Lookup handled differently in other DHTs; see Gummadi et al.’s SIGCOMM paper for details

Is this a good paper? What were the authors’ goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the “Test of Time” challenge? How would you review this paper today?

Decentralized Object Location and Routing: (DOLR) The core of Tapestry Routes messages to endpoints Both Nodes and Objects Virtualizes resources objects are known by name, not location

Routing to Data, not endpoints Routing to Data, not endpoints! Decentralized Object Location and Routing GUID1 DOLR Here we have a person that wants to listen to this mp3, they submit a GUID to the infrastructure. And get back fragments that can be verified and used to reconstruct the block. GUID2 GUID1

DOLR Identifiers ID Space for both nodes and endpoints (objects) : 160-bit values with a globally defined radix (e.g. hexadecimal to give 40-digit IDs) Each node is randomly assigned a nodeID Each endpoint is assigned a Globally Unique IDentifier (GUID) from the same ID space Typically done using SHA-1 Applications can also have IDs (application specific), which are used to select an appropriate process on each node for delivery

DOLR API PublishObject(OG, Aid) UnpublishObject(OG, Aid) RouteToObject(OG, Aid) RouteToNode(N, Aid, Exact)

Node State Each node stores a neighbor map similar to Pastry Each level stores neighbors that match a prefix up to a certain position in the ID Invariant: If there is a hole in the routing table, there is no such node in the network For redundancy, backup neighbor links are stored Currently 2 Each node also stores backpointers that point to nodes that point to it Creates a routing mesh of neighbors

Routing Mesh

Routing Every ID is mapped to a root An ID’s root is either the node where nodeID = ID or the “closest” node to which that ID routes Uses prefix routing (like Pastry) Lookup for 42AD: 4*** => 42** => 42A* => 42AD If there is an empty neighbor entry, then use surrogate routing Route to the next highest (if no entry for 42**, try 43**)

Basic Tapestry Mesh Incremental Prefix-based Routing 4 2 3 1 NodeID 0xEF34 NodeID 0xEF34 0xEF31 0xEFBA 0x0921 0xE932 0xEF37 0xE324 0xEF97 0xEF32 0xFF37 0xE555 0xE530 0xEF44 0x0999 0x099F 0xE399 0xEF40

Object Publication A node sends a publish message towards the root of the object At each hop, nodes store pointers to the source node Data remains at source. Exploit locality without replication (such as in Pastry, Freenet) With replicas, the pointers are stored in sorted order of network latency Soft State – must periodically republish

Object Location Client sends message towards object’s root Each hop checks its list of pointers If there is a match, the message is forwarded directly to the object’s location Else, the message is routed towards the object’s root Because pointers are sorted by proximity, each object lookup is directed to the closest copy of the data

Use of Mesh for Object Location Getting Locality in the mesh. Objects belong to root sharing same ID

Node Insertions A insertion for new node N must accomplish the following: All nodes that have null entries for N need to be alerted of N’s presence Acknowledged muliticast from the “root” node of N’s ID to visit all nodes with the common prefix N may become the new root for some objects. Move those pointers during the muliticast N must build its routing table All nodes contacted during muliticast contact N and become its neighbor set Iterative nearest neighbor search based on neighbor set Nodes near N might want to use N in their routing tables as an optimization Also done during iterative search

Node Deletions Voluntary Involuntary Backpointer nodes are notified, which fix their routing tables and republish objects Involuntary Periodic heartbeats: detection of failed link initiates mesh repair (to clean up routing tables) Soft state publishing: object pointers go away if not republished (to clean up object pointers) Discussion Point: Node insertions/deletions + heartbeats + soft state republishing = network overhead. Is it acceptable? What are the tradeoffs?

Tapestry Architecture OceanStore, etc deliver(), forward(), route(), etc. Tier 0/1: Routing, Object Location Connection Mgmt TCP, UDP Prototype implemented using Java

Experimental Results (I) 3 environments Local cluster, PlanetLab, Simulator Micro-benchmarks on local cluster Message processing overhead Proportional to processor speed - Can utilize Moore’s Law Message throughput Optimal size is 4KB

Experimental Results (II) Routing/Object location tests Routing overhead (PlanetLab) About twice as long to route through overlay vs IP Object location/optimization (PlanetLab/Simulator) Object pointers significantly help routing to close objects Network Dynamics Node insertion overhead (PlanetLab) Sublinear latency to stabilization O(LogN) bandwidth consumption Node failures, joins, churn (PlanetLab/Simulator) Brief dip in lookup success rate followed by quick return to near 100% success rate Churn lookup rate near 100%

Object Location with Tapestry RDP (Relative Delay Penalty) Under 2 in the wide area More trouble in local area – (why?) Optimizations: More pointers (in neighbors, etc) Detect wide-area links and make sure that pointers on exit nodes to wide area

Stability under extreme circumstances (May 2003: 1.5 TB over 4 hours) DOLR Model generalizes to many simultaneous apps

Possibilities for DOLR? Original Tapestry Could be used to route to data or endpoints with locality (not routing to IP addresses) Self adapting to changes in underlying system Pastry Similarities to Tapestry, now in nth generation release Need to build locality layer for true DOLR Bamboo Similar to Pastry – very stable under churn Other peer-to-peer options Coral: nice stable system with course-grained locality Chord: very simple system with locality optimizations

Is this a good paper? What were the authors’ goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the “Test of Time” challenge? How would you review this paper today?

Final topic: Churn (Optional Bamboo paper) Chord is a “scalable protocol for lookup in a dynamic peer-to-peer system with frequent node arrivals and departures” -- Stoica et al., 2001 Authors Systems Observed Session Time SGG02 Gnutella, Napster 50% < 60 minutes CLL02 31% < 10 minutes SW02 FastTrack 50% < 1 minute BSV03 Overnet GDS03 Kazaa 50% < 2.4 minutes So we want to focus on lookup, but why churn? At the very least, we know that churn is prevalent in deployed P2P systems, and it’s pretty bad Many other DHT applications might see lighter churn, or may not work at all under churn, but why not shoot for the moon? After all, DHTs were once proposed as being able to help build better P2P systems, and can only assume this includes file-sharing systems, the one clearly successful P2P application

A Simple lookup Test Start up 1,000 DHT nodes on ModelNet network Emulates a 10,000-node, AS-level topology Unlike simulations, models cross traffic and packet loss Unlike PlanetLab, gives reproducible results Churn nodes at some rate Poisson arrival of new nodes Random node departs on every new arrival Exponentially distributed session times Each node does 1 lookup every 10 seconds Log results, process them after test

Early Test Results Tapestry had trouble under this level of stress Worked great in simulations, but not as well on more realistic network Despite sharing almost all code between the two! Problem was not limited to Tapestry consider Chord: Built OceanStore on Tapestry, worked well as long as there were no failures, but not so well otherwise Decided to generate a DHT benchmark as a metric of success Once started benchmarking with node failures and arrivals (called churn), nothing worked, not Tapestry, not Chord, and not Pastry! Started building Bamboo to handle churn At the same time, other DHTs were improving, too. See the paper for their current performance numbers. Rather that study benchmark results, decided to focus on why some implementations did well while others didn’t. Used Bamboo because we had it handy, and we understood the code To limit the scope of the study, decided to focus only on lookup functionality At any rate, correct, quick lookups are necessary to get other functionality to handle churn

Handling Churn in a DHT Forget about comparing different impls. Too many differing factors Hard to isolate effects of any one feature Implement all relevant features in one DHT Using Bamboo (similar to Pastry) Isolate important issues in handling churn Recovering from failures Routing around suspected failures Proximity neighbor selection

Reactive Recovery: The obvious technique For correctness, maintain leaf set during churn Also routing table, but not needed for correctness The Basics Ping new nodes before adding them Periodically ping neighbors Remove nodes that don’t respond Simple algorithm After every change in leaf set, send to all neighbors Called reactive recovery

The Problem With Reactive Recovery Under churn, many pings and change messages If bandwidth limited, interfere with each other Lots of dropped pings looks like a failure Respond to failure by sending more messages Probability of drop goes up We have a positive feedback cycle (squelch) Can break cycle two ways Limit probability of “false suspicions of failure” Recovery periodically

Periodic Recovery Periodically send whole leaf set to a random member Breaks feedback loop Converges in O(log N) Back off period on message loss Makes a negative feedback cycle (damping)

Conclusions/Recommendations Avoid positive feedback cycles in recovery Beware of “false suspicions of failure” Recover periodically rather than reactively Route around potential failures early Don’t wait to conclude definite failure TCP-style timeouts quickest for recursive routing Virtual-coordinate-based timeouts not prohibitive PNS can be cheap and effective Only need simple random sampling