Peter Druschel, Rice University Antony Rowstron,

Slides:



Advertisements
Similar presentations
Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Scalable Content-Addressable Network Lintao Liu
Storage management and caching in PAST Antony Rowstron and Peter Druschel Presented to cs294-4 by Owen Cooper.
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.
Storage management and caching in PAST, a large-scale, persistent peer- to-peer storage utility Antony Rowstron, Peter Druschel.
Peer-to-Peer Structured Overlay Networks
Scalable peer-to-peer substrates: A new foundation for distributed applications? Peter Druschel, Rice University Antony Rowstron, Microsoft Research Cambridge,
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 PASTRY Partially borrowed from Gabi Kliot ’ s presentation.
Pastry Scalable, decentralized object location and routing for large-scale peer-to-peer systems Peter Druschel, Rice University Antony Rowstron, Microsoft.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Applications over P2P Structured Overlays Antonino Virgillito.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.
Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.
Pastry Partially borrowed for Gabi Kliot. Pastry Scalable, decentralized object location and routing for large-scale peer-to-peer systems  Antony Rowstron.
1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
SCRIBE: A large-scale and decentralized application-level multicast infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron.
Distributed Lookup Systems
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics.
Wide-area cooperative storage with CFS
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems (Antony Rowstron and Peter Druschel) Shariq Rizvi First.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 PASTRY. 2 Pastry paper “ Pastry: Scalable, decentralized object location and routing for large- scale peer-to-peer systems ” by Antony Rowstron (Microsoft.
PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.
Security Michael Foukarakis – 13/12/2004 A Survey of Peer-to-Peer Security Issues Dan S. Wallach Rice University,
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Storage Management and Caching in PAST A Large-scale persistent peer-to-peer storage utility Presented by Albert Tannous CSE 598D: Storage Systems – Dr.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Pastry Antony Rowstron and Peter Druschel Presented By David Deschenes.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Squirrel: A decentralized peer-to- peer web cache Paper by Sitaram Iyer, Antony Rowstron and Peter Druschel (© 2002) Presentation* by Alexander Prohaska.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.
Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.
Peer-to-Peer Information Systems Week 12: Naming
Antony Rowstron, Microsoft Research Cambridge, UK
Pastry Scalable, decentralized object locations and routing for large p2p systems.
Distributed Hash Tables
Controlling the Cost of Reliability in Peer-to-Peer Overlays
(slides by Nick Feamster)
COS 461: Computer Networks
COMP/ELEC 429 Introduction to Computer Networks
CHAPTER 3 Architectures for Distributed Systems
Plethora: Infrastructure and System Design
PASTRY.
CS5412: Using Gossip to Build Overlay Networks
Presentation by Theodore Mao CS294-4: Peer-to-peer Systems
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
Distributed Hash Tables
CS5412: Using Gossip to Build Overlay Networks
COS 461: Computer Networks
Applications (2) Outline Overlay Networks Peer-to-Peer Networks.
Peer-to-Peer Information Systems Week 12: Naming
Peer-to-Peer Networks
CS5412: Using Gossip to Build Overlay Networks
Presentation transcript:

Scalable peer-to-peer substrates: A new foundation for distributed applications? Peter Druschel, Rice University Antony Rowstron, Microsoft Research Cambridge, UK Collaborators: Miguel Castro, Anne-Marie Kermarrec, MSR Cambridge Y. Charlie Hu, Sitaram Iyer, Animesh Nandi, Atul Singh, Dan Wallach, Rice University

Outline Background Pastry Pastry proximity routing PAST SCRIBE Conclusions

Background Peer-to-peer systems distribution decentralized control self-organization symmetry (communication, node roles)

Peer-to-peer applications Pioneers: Napster, Gnutella, FreeNet File sharing: CFS, PAST [SOSP’01] Network storage: FarSite [Sigmetrics’00], Oceanstore [ASPLOS’00], PAST [SOSP’01] Web caching: Squirrel[PODC’02] Event notification/multicast: Herald [HotOS’01], Bayeux [NOSDAV’01], CAN-multicast [NGC’01], SCRIBE [NGC’01], SplitStream [submitted] Anonymity: Crowds [CACM’99], Onion routing [JSAC’98] Censorship-resistance: Tangler [CCS’02]

Common issues Organize, maintain overlay network node arrivals node failures Resource allocation/load balancing Resource location Network proximity routing Idea: provide a generic p2p substrate

Architecture P2p application layer P2p substrate (self-organizing Event notification Network storage ? P2p application layer P2p substrate (self-organizing overlay network) Pastry TCP/IP Internet

Structured p2p overlays One primitive: route(M, X): route message M to the live node with nodeId closest to key X nodeIds and keys are from a large, sparse id space

Distributed Hash Tables (DHT) nodes k1,v1 k2,v2 k3,v3 P2P overlay network Operations: insert(k,v) lookup(k) k4,v4 k5,v5 k6,v6 p2p overlay maps keys to nodes completely decentralized and self-organizing robust, scalable

Why structured p2p overlays? Leverage pooled resources (storage, bandwidth, CPU) Leverage resource diversity (geographic, ownership) Leverage existing shared infrastructure Scalability Robustness Self-organization

Outline Background Pastry Pastry proximity routing PAST SCRIBE Conclusions

Pastry: Related work Chord [Sigcomm’01] CAN [Sigcomm’01] Tapestry [TR UCB/CSD-01-1141] PNRP [unpub.] Viceroy [PODC’02] Kademlia [IPTPS’02] Small World [Kleinberg ’99, ‘00] Plaxton Trees [Plaxton et al. ’97]

Pastry: Object distribution Consistent hashing [Karger et al. ‘97] 128 bit circular id space nodeIds (uniform random) objIds (uniform random) Invariant: node with numerically closest nodeId maintains object 2128-1 O objId Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. nodeIds

Pastry: Object insertion/lookup 2128-1 O Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible X Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. Route(X)

Pastry: Routing Tradeoff O(log N) routing table size O(log N) message forwarding steps

Pastry: Routing table (# 65a1fcx) Row 0 Row 1 Row 2 Row 3 log16 N rows

Pastry: Routing Properties log16 N steps O(log N) state d471f1 d467c4 d462ba d46a1c d4213f Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. Properties log16 N steps O(log N) state Route(d46a1c) d13da3 65a1fc

Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively. routing efficiency/robustness fault detection (keep-alive) application-specific local coordination

Pastry: Routing procedure if (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (Rld exists) forward to Rld forward to a known node that (a) shares at least as long a prefix (b) is numerically closer than this node

Pastry: Performance Integrity of overlay/ message delivery: guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: No failures: < log16 N expected, 128/b + 1 max During failure recovery: O(N) worst case, average case much better

Pastry: Self-organization Initializing and maintaining routing tables and leaf sets Node addition Node departure (failure)

Pastry: Node addition d471f1 d467c4 d462ba d46a1c New node: d46a1c Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. Route(d46a1c) d13da3 65a1fc

Node departure (failure) Leaf set members exchange keep-alive messages Leaf set repair (eager): request set from farthest live node in set Routing table repair (lazy): get table from peers in the same row, then higher rows

Pastry: Experimental results Prototype implemented in Java emulated network deployed testbed (currently ~25 sites worldwide)

Pastry: Average # of hops L=16, 100k random queries

Pastry: # of hops (100k nodes) L=16, 100k random queries

Pastry: # routing hops (failures) 2.73 2.96 2.74 2.6 2.65 2.7 2.75 2.8 2.85 2.9 2.95 3 No Failure Failure After routing table repair Average hops per lookup L=16, 100k random queries, 5k nodes, 500 failures

Outline Background Pastry Pastry proximity routing PAST SCRIBE Conclusions

Pastry: Proximity routing Assumption: scalar proximity metric e.g. ping delay, # IP hops a node can probe distance to any other node Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix. Locality-related route qualities: Distance traveled Likelihood of locating the nearest replica

Pastry: Routes in proximity space d467c4 65a1fc d13da3 d4213f d462ba Proximity space d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space

Pastry: Distance traveled L=16, 100k random queries, Euclidean proximity space

Pastry: Locality properties 1) Expected distance traveled by a message in the proximity space is within a small constant of the minimum 2) Routes of messages sent by nearby nodes with same keys converge at a node near the source nodes 3) Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first

Pastry: Node addition d467c4 d471f1 d467c4 d462ba d46a1c d4213f 65a1fc d13da3 d4213f d462ba Proximity space New node: d46a1c d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically.

Pastry delay vs IP delay GATech top., .5M hosts, 60K nodes, 20K random messages

Pastry: API route(M, X): route message M to node with nodeId numerically closest to X deliver(M): deliver message M to application forwarding(M, X): message M is being forwarded towards key X newLeaf(L): report change in leaf set L to application

Pastry: Security Secure nodeId assignment Secure node join protocols Randomized routing Byzantine fault-tolerant leaf set membership protocol

Pastry: Summary Generic p2p overlay network Scalable, fault resilient, self-organizing, secure O(log N) routing steps (expected) O(log N) routing table size Network proximity routing

Outline Background Pastry Pastry proximity routing PAST SCRIBE Conclusions

PAST: Cooperative, archival file storage and distribution Layered on top of Pastry Strong persistence High availability Scalability Reduced cost (no backup) Efficient use of pooled resources

PAST API Insert - store replica of a file at k diverse storage nodes Lookup - retrieve file from a nearby live storage node that holds a copy Reclaim - free storage associated with a file Files are immutable

PAST: File storage fileId Insert fileId PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace. Insert fileId

PAST: File storage Storage Invariant: File “replicas” are fileId Insert fileId k=4 Storage Invariant: File “replicas” are stored on k nodes with nodeIds closest to fileId (k is bounded by the leaf set size) PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace.

PAST: File Retrieval C k replicas Lookup fileId file located in log16 N steps (expected) usually locates replica nearest client C The last point is shown pictorally here. A lookup request is routed in at most log16 N steps to a node that stores a replica, if one exists. In practice, the node among the k that first receives the message serves the file. Furthermore, network locality properties of Pastry (not discussed in this talk) ensure that this is node is usually the node that is closest to the client in the network !!

PAST: Exploiting Pastry Random, uniformly distributed nodeIds replicas stored on diverse nodes Uniformly distributed fileIds e.g. SHA-1(filename,public key, salt) approximate load balance Pastry routes to closest live nodeId availability, fault-tolerance A number of interesting properties emerge: since nodeId assignment is random, neighboring nodes in namespace are diverse in location, ownership, jurisdiction, network attachment -- thus, excellent candidates for storing replicas of a file. fileId are pseudo-randomly assigned and, like nodeIds, uniformly distributed in the namespace. Thus the number of files assigned to each node is roughly balanced. Pastry routes requests to live node with closest nodeId. Thus, file is available unless all k nodes die simultaneously.

PAST: Storage management Maintain storage invariant Balance free space when global utilization is high statistical variation in assignment of files to nodes (fileId/nodeId) file size variations node storage capacity variations Local coordination only (leaf sets)

Experimental setup Web proxy traces from NLANR Filesystem 18.7 Gbytes, 10.5K mean, 1.4K median, 0 min, 138MB max Filesystem 166.6 Gbytes. 88K mean, 4.5K median, 0 min, 2.7 GB max 2250 PAST nodes (k = 5) truncated normal distributions of node storage sizes, mean = 27/270 MB Using appropriate workloads to evaluate systems like PAST is difficult, because few such systems exist and workloads are difficult to capture. We used two traces – the filesystem is described in the Paper. We chose to use two existing workloads with different characteristics, to probe the space of workload characteristics that a system like PAST might encounter in practice. In particular...

Need for storage management No diversion (tpri = 1, tdiv = 0): max utilization 60.8% 51.1% inserts failed Replica/file diversion (tpri = .1, tdiv = .05): max utilization > 98% < 1% inserts failed

PAST: File insertion failures Leave this out if running out of time

PAST: Caching Nodes cache files in the unused portion of their allocated disk space Files caches on nodes along the route of lookup and insert messages Goals: maximize query xput for popular documents balance query load improve client latency

PAST: Caching fileId Lookup topicId PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace. Lookup topicId

PAST: Caching

PAST: Security No read access control; users may encrypt content for privacy File authenticity: file certificates System integrity: nodeIds, fileIds non-forgeable, sensitive messages signed Routing randomized

PAST: Storage quotas Balance storage supply and demand user holds smartcard issued by brokers hides user private key, usage quota debits quota upon issuing file certificate storage nodes hold smartcards advertise supply quota storage nodes subject to random audits within leaf sets

PAST: Related Work CFS [SOSP’01] OceanStore [ASPLOS 2000] FarSite [Sigmetrics 2000]

Outline Background Pastry Pastry locality properties PAST SCRIBE Conclusions

SCRIBE: Large-scale, decentralized multicast Infrastructure to support topic-based publish-subscribe applications Scalable: large numbers of topics, subscribers, wide range of subscribers/topic Efficient: low delay, low link stress, low node overhead

SCRIBE: Large scale multicast topicId PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace. Publish topicId Subscribe topicId

Scribe: Results Simulation results Comparison with IP multicast: delay, node stress and link stress Experimental setup Georgia Tech Transit-Stub model 100,000 nodes randomly selected out of .5M Zipf-like subscription distribution, 1500 topics

Scribe: Topic popularity Sunscribers to the topic were selected randomly wwith a uniformprobability gsize(r) = floor(Nr -1.25 + 0.5); N=100,000; 1500 topics

Relative delay penalty, average and maximum Scribe: Delay penalty We mesaured the entire distribution of delay all topics together. What is plotted here is the cumulative distribution of ratio of average dealy Relative delay penalty, average and maximum

Scribe: Node stress

One message published in each of the 1,500 topics Scribe: Link stress This plot the link stress of both Scribe and IP. The maximum for Ip is obviously 1500 and is represented by the blue line. We can see that One message published in each of the 1,500 topics

Related works Narada Bayeux/Tapestry CAN-Multicast

Summary Self-configuring P2P framework for topic-based publish-subscribe Scribe achieves reasonable performance when compared to IP multicast Scales to a large number of subscribers Scales to a large number of topics Good distribution of load

Status Functional prototypes Pastry [Middleware 2001] PAST [HotOS-VIII, SOSP’01] SCRIBE [NGC 2001, IEEE JSAC] SplitStream [submitted] Squirrel [PODC’02] http://www.cs.rice.edu/CS/Systems/Pastry

Current Work Security Keyword search capabilities secure routing/overlay maintenance/nodeId assignment quota system Keyword search capabilities Support for mutable files in PAST Anonymity/Anti-censorship New applications Free software releases

Conclusion For more information http://www.cs.rice.edu/CS/Systems/Pastry