Antony Rowstron, Microsoft Research Cambridge, UK

Slides:

Advertisements

Similar presentations

Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge

Advertisements

Squirrel: A peer-to- peer web cache Sitaram Iyer (Rice University) Joint work with Ant Rowstron (MSR Cambridge) Peter Druschel (Rice University) PODC 2002.

Squirrel: A peer-to-peer web cache Sitaram Iyer Joint work with Ant Rowstron (MSRC) and Peter Druschel.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK

Scalable Content-Addressable Network Lintao Liu

Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.

Storage management and caching in PAST Antony Rowstron and Peter Druschel Presented to cs294-4 by Owen Cooper.

Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.

Storage management and caching in PAST, a large-scale, persistent peer- to-peer storage utility Antony Rowstron, Peter Druschel.

Peer-to-Peer Structured Overlay Networks

Scalable peer-to-peer substrates: A new foundation for distributed applications? Peter Druschel, Rice University Antony Rowstron, Microsoft Research Cambridge,

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

1 PASTRY Partially borrowed from Gabi Kliot ’ s presentation.

Pastry Scalable, decentralized object location and routing for large-scale peer-to-peer systems Peter Druschel, Rice University Antony Rowstron, Microsoft.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.

Applications over P2P Structured Overlays Antonino Virgillito.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.

Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.

Pastry Partially borrowed for Gabi Kliot. Pastry Scalable, decentralized object location and routing for large-scale peer-to-peer systems  Antony Rowstron.

1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer.

Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.

Distributed Lookup Systems

Large Scale Sharing GFS and PAST Mahesh Balakrishnan.

1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics.

Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.

Pastry And Squirrel Presented by Eirik T. Laberg Håvard Semundseth Orri G. Pálsson.

Wide-area cooperative storage with CFS

1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.

Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems (Antony Rowstron and Peter Druschel) Shariq Rizvi First.

Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.

Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)

1 PASTRY. 2 Pastry paper “ Pastry: Scalable, decentralized object location and routing for large- scale peer-to-peer systems ” by Antony Rowstron (Microsoft.

Security Michael Foukarakis – 13/12/2004 A Survey of Peer-to-Peer Security Issues Dan S. Wallach Rice University,

Storage Management and Caching in PAST A Large-scale persistent peer-to-peer storage utility Presented by Albert Tannous CSE 598D: Storage Systems – Dr.

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.

Squirrel: A decentralized peer-to- peer web cache Paper by Sitaram Iyer, Antony Rowstron and Peter Druschel (© 2002) Presentation* by Alexander Prohaska.

Peer to Peer Network Design Discovery and Routing algorithms

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.

Peer-to-Peer Networks 05 Pastry Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg.

Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.

Distributed Web Systems Peer-to-Peer Systems Lecturer Department University.

Peer-to-Peer Information Systems Week 12: Naming

Pastry Scalable, decentralized object locations and routing for large p2p systems.

Distributed Hash Tables

Controlling the Cost of Reliability in Peer-to-Peer Overlays

COS 461: Computer Networks

CHAPTER 3 Architectures for Distributed Systems

Plethora: Infrastructure and System Design

Accessing nearby copies of replicated objects

Peter Druschel, Rice University Antony Rowstron,

Early Measurements of a Cluster-based Architecture for P2P Systems

CS5412: Using Gossip to Build Overlay Networks

Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service

Distributed Hash Tables

CS5412: Using Gossip to Build Overlay Networks

COS 461: Computer Networks

Applications (2) Outline Overlay Networks Peer-to-Peer Networks.

A Semantic Peer-to-Peer Overlay for Web Services Discovery

Peer-to-Peer Information Systems Week 12: Naming

CS5412: Using Gossip to Build Overlay Networks

Presentation transcript:

Scalable peer-to-peer substrates: A new foundation for distributed applications? Antony Rowstron, Microsoft Research Cambridge, UK Peter Druschel, Rice University Collaborators: Miguel Castro, Anne-Marie Kermarrec, MSR Cambridge Y. Charlie Hu, Sitaram Iyer, Dan Wallach, Rice University

Outline Background Squirrel Pastry Pastry locality properties SCRIBE Conclusions

Background: Peer-to-peer Systems distributed decentralized control self-organizing symmetric communication/roles

Background Peer-to-peer applications Pioneers: Napster, Gnutella, FreeNet File sharing: CFS, PAST [SOSP’01] Network storage: FarSite [Sigmetrics’00], Oceanstore [ASPLOS’00], PAST [SOSP’01] Multicast: Herald [HotOS’01], Bayeux [NOSDAV’01], CAN-multicast [NGC’01], SCRIBE [NGC’01]

Common issues Organize, maintain overlay network node arrivals node failures Resource allocation/load balancing Resource location Locality (network proximity) Idea: generic P2P substrate

Architecture P2p application layer P2p substrate (self-organizing Event notification Network storage ? P2p application layer P2p substrate (self-organizing overlay network) DHT TCP/IP Internet

DHTs Peer-to-peer object location and routing substrate Distributed Hash Table: maps object key to a live node Insert(key,object) Lookup(key) Key typically 128 bits Pastry (developed at MSR Cambrdige/Rice) is an example of such an infrastructure. If we want to go about constructing a p2p cache, the easiest way, I believe, is to leverage a p2p routing protocol. For instance, Pastry. The fancy name is a p2p object location and routing substrate; what it really gives you is distributed hash table functionality. (explain)

DHT Related work Chord (MIT/UCB) [Sigcomm’01] CAN (ACIRI/UCB) [Sigcomm’01] Tapestry (UCB) [TR UCB/CSD-01-1141] PNRP (Microsoft) [Huitema et al, unpub.] [Kleinberg ’99] [Plaxton et al. ’97]

Outline Background Squirrel Pastry Pastry locality properties SCRIBE Conclusions

Squirrel: Web Caching Reduce latency, Reduce external bandwidth Reduce web server load. ISPs, Corporate network boundaries, etc. Cooperative Web Caching: group of web caches working together and acting as one web cache. Object = html page, image, etc. Focus on corporate network boundary caches.

Web Cache Sharing! LAN Internet Browser Cache Browser Centralized Web Cache Web Server Browser Cache Browser LAN Internet Sharing!

Decentralized Web Cache Browser Cache Browser Web Server Browser Cache The goal of this talk is to get rid of this box. When a node doesn’t have an object in your local cache, it “somehow” discovers that a different node has it, and gets it from there. Kinda like a cooperative cache. So these nodes export their local caches to other nodes in the network, and these combine to form a large virtual cache. Browser LAN Internet Why? How?

Why peer-to-peer ? Cost of dedicated web cache Administrative costs No additional hardware Administrative costs Self-organizing Scaling needs upgrading Resources grow with clients Single point of failure Fault-tolerant by design Dedicated hardware. WHY? In some large organizations in the north-west, they use clusters of as many as 30 machines. Overprovision resources for peak loads, why? Administrative costs, both in terms of configuring those machines, and for cooperative web caches, setting up the hierarchy or mesh or whatever. Why? Much nicer to have it self-configuring; typically there is some software dissemination mechanism, so installing is like a few clicks, and done. Constant pressure for upgrading the cluster. With p2p, number of “clients” = number of “servers”, so it is potentially self-scaling. Many people don’t use web caches because they fail and their connections go bang. In this model, so what if some nodes die?! Again, these are only potential benefits. They do not automatically happen.

Setting Corporate LAN 100 - 100,000 desktop machines Single physical location Each node runs an instance of Squirrel Sets it as the browser’s proxy

Approaches Home-store model Directory model Both approaches require key generation: Hash(URL) Collision resistant (e.g. SHA1) Hash(http://www.research.microsoft.com/~antr) -> 4ff367a14b374e3dd99f

Home-store model client home URL hash LAN Internet Now that we have a routing protocol, I’ll propose two schemes for mapping Squirrel onto it. LAN Internet home

Home-store model client Verry simple. home …that’s how it works!

Directory model Client nodes always store objects in local caches. Main difference between the two schemes: whether the home node also stores the object. In the directory model, it only stores pointers to recent clients, and forwards requests to them. Objects are always stored at clients; in the home-store model, they are stored at home nodes too. Why? Suppose the home node does not store the object, but merely maintains a pointer to nodes that recently accessed the object, and forwards subsequent requests to these nodes.

Directory model client home Net LAN Home says “go get it yourself”. … Meanwhile … creates a directory and stores a pointer to this client. By pointer I mean, of course, the IP address. home

Directory model client delegate home random entry (explain protocol). What’s the motivation? One is that we don’t store objects at the home. But more interestingly, we expect that for an object accessed by many nodes, the directory of most recently accessing nodes keeps rapidly changing. So a randomly chosen entry will achieve some kind of load balancing. Whether this happens, we’ll see. random entry home

(skip) Full directory protocol server e : cGET req origin other req home client 2 b : not-modified 3 e 1 c ,e : req c ,e : object 4 a , d a , d : req a : no dir, go to origin. Also d not-modified object or delegate In reality, there is the pesky issue that sometimes node store expired versions of objects and issue “conditional get” requests to validate them. This complicates the protocol a little, but fortunately I shall not get into the details. It’s not a big deal, just a few more cases to consider.

Recap Two endpoints of design space, based on the choice of storage location. At first sight, both seem to do about as well. (e.g. hit ratio, latency). Appreciate that these two designs are the two endpoints of the design space, based on what choice you make regarding object storage location. If an object is in the centralized cache, it should be on some client node in both p2p caching schemes; so we expect, roughly, that hit ratio is the same. User-perceived latency is overshadowed by accesses outside the network, so if the hops within the LAN aren’t too many, then all methods are roughly similar.

Quirk .. evaluation on trace-based workloads .. Consider a Web page with many images, or Heavily browsing node In the Directory scheme, Many home nodes pointing to one delegate Home-store: natural load balancing There’s a strange quirk in the directory scheme. … … When some node later accesses this html page with many images, all these requests pounce on a single client, and its load shoots. Does this matter? We don’t know; simulations will tell. .. evaluation on trace-based workloads ..

Trace characteristics Redmond Cambridge Total duration 1 day 31 days Number of clients 36,782 105 Number of HTTP requests 16.41 million 0.971 million Peak request rate 606 req/sec 186 req/sec Number of objects 5.13 million 0.469 million Number of cacheable objects 2.56 million 0.226 million Mean cacheable object reuse 5.4 times 3.22 times Two traces with very different characteristics, and hoping to bring out the fundamental properties of the protocols. All clients act as Squirrel nodes. Wide variation in number of requests. Peak request rate high, so we need a cluster of machines (or a powerful machine in the latter case) for a centralized cache. About half the objects are cacheable; others are sent out by the client directly. There is some reuse of cacheable objects; otherwise web caching would be pointless.

Total external bandwidth 85 90 95 100 105 0.001 0.01 0.1 1 10 Total external bandwidth (in GB) [lower is better] Per-node cache size (in MB) Directory Home-store No web cache Centralized cache Redmond First metric, external bandwidth. Define. Two blue lines depict extbw with a centralized cache of infinite size and no cache; this difference is the benefit of web caching in terms of extbw. X-axis is logscale; even for small values of per-node cache, there is performance close to centralized. This shows that there is good pooling of disk space from around the network. Home-store performs better than directory. This is because some node in the latter stores a large page with many images, or many web pages; when it fills up and evicts something, the subsequent clients need to go out. Home-store’s natural load balancing helps avoid hotspots of storage. This is interesting, because although home-store stores more, its storage utilization is significantly better than directory.

Total external bandwidth 5.5 5.6 5.7 5.8 5.9 6 6.1 0.001 0.01 0.1 1 10 100 Total external bandwidth (in GB) [lower is better] Per-node cache size (in MB) Directory Home-store No web cache Centralized cache Cambridge Similar behaviour, only scaled down in extbw.

LAN Hops Redmond 0% 20% 40% 60% 80% 100% 1 2 3 4 5 6 1 2 3 4 5 6 Fraction of cacheable requests Total hops within the LAN Centralized Home-store Directory Redmond Latency is overshadowed by external accesses if LAN hops are few (not like 50). Centralized to and fro. Home-store 3-to-4 + 1. Directory +1 sometimes when it forwards; when home node says “go get it yourself” there is no forwarding. Very rare occasions when delegate has failed and two more hops.

LAN Hops Cambridge 0% 20% 40% 60% 80% 100% 1 2 3 4 5 1 2 3 4 5 Fraction of cacheable requests Total hops within the LAN Centralized Home-store Directory Cambridge Same thing, only shifted left: Home-store 1-to-2 hops + 1.

Load in requests per sec 100000 Home-store Directory 10000 1000 Redmond Number of such seconds 100 10 Finally, load. This is a problem that isn’t there with a centralized cache, that we need to maintain bursty and sustained load low. <click> Consider this point: it means that there are a hundred occasions during the entire trace when some node in the network services as many as 23 requests in some second. So a narrow, left-stacked set of bars is a good thing. Means there is one occasion during the day when some node services 50 requests/sec: which is a tad on the high side. Home-store’s natural load balancing keeps load always as low as 10 req/sec. 1 10 20 30 40 50 Max objects served per-node / second

Load in requests per sec Home-store Directory 1e+06 100000 Cambridge 10000 Number of such seconds 1000 100 Fact that both peak loads are almost the same numbers as in the Redmond trace speaks for the scalability of the system. 10 1 10 20 30 40 50 Max objects served per-node / second

Load in requests per min 100 Home-store Directory Redmond 10 Number of such minutes We talked of sustained load; it can be illustrated by measuring per-minute load. 370 requests per minute is rather high, compared to 60 per minute. Recall we talked of two reasons for the directory protocol quirk: a page with many images, and a previously heavily browsing client. The former can result in a few-seconds-long burst of requests, but can hardly be expected to sustain 370 requests for a minute. So the latter reason is important too. 1 50 100 150 200 250 300 350 Max objects served per-node / minute

Load in requests per min Home-store Directory 10000 1000 Cambridge Number of such minutes 100 10 Similar. 1 20 40 60 80 100 120 Max objects served per-node / minute

Outline Background Squirrel Pastry Pastry locality properties SCRIBE Conclusions

Pastry Self-organizing overlay network Consistent hashing Generic p2p location and routing substrate (DHT) Self-organizing overlay network Consistent hashing Lookup/insert object in < log16 N routing steps (expected) O(log N) per-node state Network locality heuristics

Pastry: Object distribution 2128 - 1 O Consistent hashing [Karger et al. ‘97] 128 bit circular id space nodeIds (uniform random) objIds/keys (uniform random) Invariant: node with numerically closest nodeId maintains object objId/key Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. nodeIds

Pastry: Object insertion/lookup 2128 - 1 O Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible X Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. Route(X)

Pastry: Routing Tradeoff O(log N) routing table size O(log N) message forwarding steps

Pastry: Routing table (# 65a1fcx) Row 0 Row 1 Row 2 Row 3 log16 N rows

Pastry: Routing Properties log16 N steps O(log N) state d471f1 d467c4 d462ba d46a1c d4213f Properties log16 N steps O(log N) state Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. Route(d46a1c) d13da3 65a1fc

Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively. routing efficiency/robustness fault detection (keep-alive) application-specific local coordination

Pastry: Routing procedure If (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (Rld exists) forward to Rld forward to a known node that (a) shares at least as long a prefix (b) is numerically closer than this node

Pastry: Routing Integrity of overlay: guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: No failures: < log16 N expected, 128/4 + 1 max During failure recovery: O(N) worst case, average case much better

Demonstration VisPastry

Pastry: Self-organization Initializing and maintaining routing tables and leaf sets Node addition Node departure (failure)

Pastry: Node addition d471f1 d467c4 d462ba d46a1c New node: d46a1c Route(d46a1c) d13da3 Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically. 65a1fc

Node departure (failure) Leaf set members exchange keep-alive messages Leaf set repair (eager): request set from farthest live node in set Routing table repair (lazy): get table from peers in the same row, then higher rows

Pastry: Experimental results Prototype implemented in Java emulated network

Pastry: Average # of hops |L|=16, 100k random queries

Pastry: # of hops (100k nodes) |L|=16, 100k random queries

Pastry: # routing hops (failures) 2.73 2.96 2.74 2.6 2.65 2.7 2.75 2.8 2.85 2.9 2.95 3 No Failure Failure After routing table repair Average hops per lookup |L|=16, 100k random queries, 5k nodes, 500 failures

Outline Background Squirrel Pastry Pastry locality properties SCRIBE Conclusions

Pastry: Locality properties Assumption: scalar proximity metric e.g. ping/RTT delay, # IP hops a node can probe distance to any other node Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.

Pastry: Routes in proximity space d467c4 65a1fc d13da3 d4213f d462ba Proximity space d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space

Pastry: Distance traveled |L|=16, 100k random queries

Pastry: Locality properties 1) Expected distance traveled by a message in the proximity space is within a small constant of the minimum 2) Routes of messages sent by nearby nodes with same keys converge at a node near the source nodes 3) Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first

Demonstration VisPastry

Pastry: Node addition d467c4 65a1fc d13da3 d4213f d462ba Proximity space New node: d46a1c d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space Each node has a randomly assigned 128-bit nodeId, circular namespace Basic operation: A message with key X, sent by any Pastry node, is delivered to the live node with nodeId closest to X in at most log16 N steps (barring node failures). Pastry uses a form of generalized hypercube routing, where the routing tables are initialized and updated dynamically.

Pastry: API route(M, X): route message M to node with nodeId numerically closest to X deliver(M): deliver message M to application (callback) forwarding(M, X): message M is being forwarded towards key X (callback) newLeaf(L): report change in leaf set L to application (callback)

Pastry: Security Secure nodeId assignment Randomized routing Byzantine fault-tolerant leaf set membership protocol

Pastry: Summary Generic p2p overlay network Scalable, fault resilient, self-organizing, secure O(log N) routing steps (expected) O(log N) routing table size Network locality properties

Outline Background Squirrel Pastry Pastry locality properties SCRIBE Conclusions

SCRIBE: Large-scale, decentralized event notification Infrastructure to support topic-based publish-subscribe applications Scalable: large numbers of topics, subscribers, wide range of subscribers/topic Efficient: low delay, low link stress, low node overhead

SCRIBE: Large scale event notification topicId Publish topicId PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace. Subscribe topicId

Scribe: Results Simulation results Comparison with IP multicast: delay, node stress and link stress Experimental setup Georgia Tech Transit-Stub model 60000 nodes randomly selected /500 000 Zipf-like subscription distribution, 1500 topics

Scribe: Topic distribution Windows Update Stock Alert Number of subscribers Instant Messaging Sunscribers to the topic were selected randomly wwith a uniformprobability Topic Rank

Scribe: Delay penalty We mesaured the entire distribution of delay all topics together. What is plotted here is the cumulative distribution of ratio of average dealy

Scribe: Node stress =

Scribe: Link stress This plot the link stress of both Scribe and IP. The maximum for Ip is obviously 1500 and is represented by the blue line. We can see that

Related works Narada Bayeux/Tapestry Multicast/CAN

Summary Scribe achieves reasonable performance /IP multicast Self-configuring P2P framework for topic-based publish-subscribe Scribe achieves reasonable performance /IP multicast Scales to a large number of subscribers Scales to a large number of topics Good distribution of load

Outline Background Squirrel Pastry Pastry locality properties SCRIBE Conclusions

PAST: Cooperative, archival file storage and distribution Layered on top of Pastry Strong persistence High availability Scalability Reduced cost (no backup) Efficient use of pooled resources

PAST API Insert - store replica of a file at k diverse storage nodes Lookup - retrieve file from a nearby live storage node that holds a copy Reclaim - free storage associated with a file Files are immutable

PAST: File storage fileId Insert fileId PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace.

PAST: File storage Storage Invariant: File “replicas” are fileId Insert fileId k=4 Storage Invariant: File “replicas” are stored on k nodes with nodeIds closest to fileId (k is bounded by the leaf set size) PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace.

PAST: File Retrieval C k replicas Lookup file located in log16 N steps (expected) usually locates replica nearest client C fileId The last point is shown pictorally here. A lookup request is routed in at most log16 N steps to a node that stores a replica, if one exists. In practice, the node among the k that first receives the message serves the file. Furthermore, network locality properties of Pastry (not discussed in this talk) ensure that this is node is usually the node that is closest to the client in the network !!

PAST: Exploiting Pastry Random, uniformly distributed nodeIds replicas stored on diverse nodes Uniformly distributed fileIds e.g. SHA-1(filename,public key, salt) approximate load balance Pastry routes to closest live nodeId availability, fault-tolerance A number of interesting properties emerge: since nodeId assignment is random, neighboring nodes in namespace are diverse in location, ownership, jurisdiction, network attachment -- thus, excellent candidates for storing replicas of a file. fileId are pseudo-randomly assigned and, like nodeIds, uniformly distributed in the namespace. Thus the number of files assigned to each node is roughly balanced. Pastry routes requests to live node with closest nodeId. Thus, file is available unless all k nodes die simultaneously.

PAST: Storage management Maintain storage invariant Balance free space when global utilization is high statistical variation in assignment of files to nodes (fileId/nodeId) file size variations node storage capacity variations Local coordination only (leaf sets)

Experimental setup Web proxy traces from NLANR Filesystem 18.7 Gbytes, 10.5K mean, 1.4K median, 0 min, 138MB max Filesystem 166.6 Gbytes. 88K mean, 4.5K median, 0 min, 2.7 GB max 2250 PAST nodes (k = 5) truncated normal distributions of node storage sizes, mean = 27/270 MB Using appropriate workloads to evaluate systems like PAST is difficult, because few such systems exist and workloads are difficult to capture. We used two traces – the filesystem is described in the Paper. We chose to use two existing workloads with different characteristics, to probe the space of workload characteristics that a system like PAST might encounter in practice. In particular...

Need for storage management No diversion (tpri = 1, tdiv = 0): max utilization 60.8% 51.1% inserts failed Replica/file diversion (tpri = .1, tdiv = .05): max utilization > 98% < 1% inserts failed

PAST: File insertion failures Leave this out if running out of time

PAST: Caching Nodes cache files in the unused portion of their allocated disk space Files caches on nodes along the route of lookup and insert messages Goals: maximize query xput for popular documents balance query load improve client latency

PAST: Caching fileId Lookup topicId PAST file storage is mapped onto the Pastry overlay network by maintaing the invariant that replicas of a file are stored on the k nodes that are numerically closest to the file’s numeric fileId. During an insert operation, an insert request for the file is routed using the fileId as the key. The node closest to fileId replicates the file on the k-1 next nearest nodes in then namespace. Lookup topicId

PAST: Caching

PAST: Security No read access control; users may encrypt content for privacy File authenticity: file certificates System integrity: nodeIds, fileIds non-forgeable, sensitive messages signed Routing randomized

PAST: Storage quotas Balance storage supply and demand user holds smartcard issued by brokers hides user private key, usage quota debits quota upon issuing file certificate storage nodes hold smartcards advertise supply quota storage nodes subject to random audits within leaf sets

PAST: Related Work CFS [SOSP’01] OceanStore [ASPLOS 2000] FarSite [Sigmetrics 2000]

Status Functional prototypes Pastry [Middleware 2001] PAST [HotOS-VIII, SOSP’01] SCRIBE [NGC 2001] Squirrel [submitted] http://www.microsoft.research.com/~antr/Pastry

Current Work Security Keyword search capabilities secure nodeId assignment quota system Keyword search capabilities Support for mutable files in PAST Anonymity/Anti-censorship New applications Software releases

I’ll be here till friday lunchtime – feel free to stop me and talk Conclusion For more information: http://www.research.microsoft.com/~antr/Pastry I’ll be here till friday lunchtime – feel free to stop me and talk