Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors
Primary issues Organize, maintain overlay network Resource allocation/load balancing Object / resource location Network proximity routing Pastry provides a generic p2p substrate
Architecture TCP/IP Pastry Network storage Event notification Internet P2p substrate (self-organizing overlay network) P2p application layer ?
Pastry Generic p2p substrate DHT-based self-organizing overlay network Lookup/insert object in < log 16 N routing steps (expected) O(log N) routing table size per-node Network proximity routing
Pastry: Object distribution objId Consistent hashing [Karger et al. ‘97] 128 bit circular id space * nodeIds (uniform random) objIds (uniform random) Invariant: node with numerically closest nodeId maintains object. ( Recall Chord ) nodeIds O
0x0x 1x1x 2x2x 3x3x 4x4x 5x5x 7x7x 8x8x 9x9x axax bxbx cxcx dxdx exex fx 60x60x 61x61x 62x62x 63x63x 64x64x 66x66x 67x67x 68x68x 69x69x 6ax6ax 6bx6bx 6cx6cx 6dx6dx 6ex6ex 6 fx 650x650x 651x651x 652x652x 653x653x 654x654x 655x655x 656x656x 657x657x 658x658x 659x659x 65bx65bx 65cx65cx 65dx65dx 65ex65ex 6 5 fx Routing table for node 65a1fc (b=4, so 2 b = 16) Leaf set Log 16 N rows
Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively. routing efficiency/robustness fault detection (keep-alive) application-specific local coordination
Pastry: Routing Properties log 16 N steps O(log N) size routing table / node d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 Prefix routing
Pastry: Routing procedure if destination is within the range of leaf set thenforward to numerically closest member else {let l = length of shared prefix} {let d = value of l-th digit in D’s address} if (R l d exists) (R l d = entry at column d row l) then forward to R l d else {rare case} forward to a known node that (a) shares at least as long a prefix, and (b) is numerically closer than this node
Pastry: Performance Integrity of overlay/ message delivery: guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: No failures: < log 16 N expected, 128/b + 1 max During failure recovery: –O(N) worst case (loose upper bound), average case much better
Self-organization How are the routing tables and leaf sets initialized and maintained? Node addition Node departure (failure)
Pastry: Node addition d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 New node: d46a1c The new node X asks node 65a1fc to route a message to it. Nodes in the route share their routing tables with X
Node departure (failure) Leaf set members exchange heartbeat Leaf set repair (eager): request the set from farthest live node Routing table repair (lazy): get table from peers in the same row, then higher rows
Pastry: Average # of hops L=16, 100k random queries
Pastry: Proximity routing Proximity metric = time delay estimated by ping A node can probe distance to any other node Each routing table entry uses a node close to the local node (in the proximity space), among all nodes with the appropriate node Id prefix.
Pastry: Routes in proximity space d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space d467c4 65a1fc d13da3 d4213f d462ba Proximity space
Pastry: Proximity routing Assumption: scalar proximity metric e.g. ping delay, # IP hops a node can probe distance to any other node Proximity invariant Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.
Pastry: Distance traveled L=16, 100k random queries, Euclidean proximity space
PAST API Insert - store replica of a file at k diverse storage nodes Lookup - retrieve file from a nearby live storage node that holds a copy Reclaim - free storage associated with a file Files are immutable
PAST: File storage Storage Invariant: File “replicas” are stored on k nodes with nodeIds closest to fileId (k is bounded by the leaf set size) fileId Insert fileId k=4
PAST operations Insert A fileId is computed from the filename, and client’s public key using SHA-1. A file certificate is issued, signed by owner’s private key. The certificate and the file are then sent to the first among the k destinations. The storing node verifies the certificate, and checks the content hash, and issues a store receipt. Lookup Client sends lookup request to fileId, and a node responds with the content and the file certificate.
PAST: File Retrieval fileId file located in log 16 N steps (expected) usually locates replica nearest to client C Lookup k replicas C
SCRIBE: Large-scale, decentralized multicast Infrastructure to support topic-based publish-subscribe applications Scalable: large numbers of topics, subscribers, wide range of subscribers/topic Efficient: low delay, low link stress, low node overhead
SCRIBE: Large scale multicast topicId Subscribe topicId Publish topicId
Summary Self-configuring P2P framework for topic-based publish-subscribe Scribe achieves reasonable performance when compared to IP multicast –Scales to a large number of subscribers –Scales to a large number of topics –Good distribution of load