Download presentation
Presentation is loading. Please wait.
1
1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics in Reliable Distributed Computing 21/11/2004 Partially borrowed from Peter Druschel ’ s presentation
2
2 Outline Introduction Pastry overview PAST Overview Storage Management Caching Experimental Results Conclusion
3
3 “ Storage management and caching in PAST, a large-scale persistent peer-to-peer storage utility ” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University) “ Pastry: scalable, decentralized object location and routing for large-scale peer-to- peer systems ” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University) Sources
4
4 PASTRY
5
5 Pastry Generic p2p location and routing substrate (DHT) Self-organizing overlay network (join, departures, locality repair) Consistent hashing Lookup/insert object in < log 2 b N routing steps (expected) O(log N) per-node state Network locality heuristics Scalable, fault resilient, self-organizing, locality aware, secure
6
6 Pastry: API nodeId=pastryInit(Credentials, Applicaton): join local node to Pastry network route(M, X): route message M to node with nodeId numerically closest to X Application callbacks: deliver(M): deliver message M to application forwarding(M, X): message M is being forwarded towards key X newLeaf(L): report change in leaf set L to application
7
7 Pastry: Object distribution objId/key Consistent hashing 128 bit circular id space nodeIds (uniform random) objIds/keys (uniform random) Invariant: node with numerically closest nodeId maintains object nodeIds O 2 128 - 1
8
8 Pastry: Object insertion/lookup X Route(X) Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible O 2 128 - 1
9
9 Pastry: Routing Tradeoff O(log N) routing table size 2 b * log 2 b N + 2l O(log N) message forwarding steps
10
10 Pastry: Routing table (# 10233102 ) L nodes in leaf set log 2 b N Rows (actually log 2 b 2 128 = 128/b) 2 b columns L neighbors
11
Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively. routing efficiency/robustness fault detection (keep-alive) application-specific local coordination
12
12 Pastry: Routing procedure If (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (R l d exists) forward to R l d else forward to a known node (from ) that (a) shares at least as long a prefix (b) is numerically closer than this node
13
13 Pastry: Routing Properties log 2 b N steps O(log N) state d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1
14
14 Pastry: Routing Integrity of overlay: guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: No failures: < log 2 b N expected, 128/b + 1 max During failure recovery: O(N) worst case, average case much better
15
15 Pastry: Locality properties Assumption: scalar proximity metric e.g. ping/RTT delay, # IP hops traceroute, subnet masks a node can probe distance to any other node Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.
16
16 Pastry: Geometric Routing in proximity space d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 d467c4 65a1f c d13da3 d4213f d462ba Proximity space The proximity distance traveled by message in each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2 bl ) The distance traveled by message from its source increases monotonically at each step (message takes larger and larger strides) NodeId space
17
17 Pastry: Locality properties Each routing step is local, but there is no guarantee of globally shortest path Nevertheless, simulations show: Expected distance traveled by a message in the proximity space is within a small constant of the minimum Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first
18
18 Pastry: Self-organization Initializing and maintaining routing tables and leaf sets Node addition Node departure (failure) The goal is to maintain all routing table entries to refer to a near node, among all live nodes with appropriate prefix
19
19 New node X contacts nearby node A A routes “ join ” message to X, which arrives to Z, closest to X X obtains leaf set from Z, i ’ th row for routing table from i ’ th node from A to Z X informs any nodes that need to be aware of its arrival X also improves its table locality by requesting neighborhood sets from all nodes X knows In practice: optimistic approach Pastry: Node addition
20
20 Pastry: Node addition X=d46a1c Route(d46a1c) d462ba d4213f d13da3 A = 65a1fc Z=d467c4 d471f1 New node: X=d46a1c
21
21 d467c4 65a1f c d13da3 d4213f d462ba Proximity space Pastry: Node addition New node: d46a1c d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space X is close to A, B is close to B1. Why X is close to B1? The expected distance from B to its row one entries (B1) is much larger than the expected distance from A to B (chosen from exponentially decreasing set size)
22
22 Node departure (failure) Leaf set repair (eager – all the time): Leaf set members exchange keep-alive messages request set from furthest live node in set Routing table repair (lazy – upon failure): get table from peers in the same row, if not found – from higher rows Neighborhood set repair (eager)
23
23 Pastry: Security Secure nodeId assignment Randomized routing – pick random node among all potential Byzantine fault-tolerant leaf set membership protocol
24
24 Pastry: Distance traveled |L|=16, 100k random queries Proximity in emulated network. Nodes paced randomly
25
25 Pastry: Summary Generic p2p overlay network Scalable, fault resilient, self-organizing, secure O(log N) routing steps (expected) O(log N) routing table size Network locality properties
26
26 PAST
27
27 INTRODUCTION PAST system Internet-based, peer-to-peer global storage utility Characteristics: strong persistence, high availability (by using k replicas) scalability (due to efficient Pastry routing) short insert and query paths query load balancing and latency reduction (due to wide dispersion, Pastry locality and caching) security Composed of nodes connected to internet, each node has 128-bit nodeId Use Pastry for efficient routing scheme No support for mutable files, searching, directory lookup
28
28 INTRODUCTION Function of nodes : store replicas of files initiate and route client requests to insert or retrieve files in PAST File-related property : Inserted files have quasi-unique fileId, File is replicated across multiple nodes To retrieve file, client must know fileId and decryption key (if necessary) fileId : 160-bit computed as SHA-1 of file name, owner ’ s public key, random salt number
29
29 PAST Operation Insert: fileId = Insert(name, owner-credentials, k, file) 1. fileId computed (hash code of file name, public key, etc.) 2. Request Message reaches one of k nodes closest to fileId 3. Node accepts a replica of the file, forwards message to k-1 nodes existing in leaf set 4. Once k nodes accept, ‘ack’ message with store receipt is passed to client Lookup: file = Lookup(fileId) Reclaim: Reclaim(fileId, owner-credentials)
30
30 STORAGE MANAGEMENT why? Responsibility Replicas of files be maintained by k nodes with nodeId closest to fileId Balance free storage space among nodes in PAST Conflict : K nodes having insufficient storage vs. neighbor nodes having sufficient storage Cause of load imbalance : 3 differences Number of files assigned to each node Size of each inserted file Storage capacity of each node Resolution : Replica diversion, File diversion
31
31 STORAGE MANAGEMENT Replica Diversion GOAL : balance the remaining free storage space among nodes in leaf set Diversion steps of node A (that received insertion request but has insufficient space) 1. choose node B among nodes in leaf set except k closest, s.t. B does not already holds diverted replica 2. ask B to store a copy 3. enter an file entry in table with pointer to B 4. send store receipt as usual
32
32 STORAGE MANAGEMENT Replica Diversion Policy for accepting a replica by node Node rejects file if file_size/remaining_storage > t Threshold t -> t pri (in primary replica), t div (in diverted replica) Avoids unnecessary diversion when node still has space Prefer diverting large files – minimize number of diversions Prefer accepting primary replicas than diverted replicas
33
33 STORAGE MANAGEMENT File Diversion GOAL : balancing the remaining free storage space among nodes in PAST network When all k nodes and their leaf sets have insufficient space Client node generate new fileId using different salt value Repeats limit : 3 times Fourth fail -> make smaller file size by fragmenting
34
34 STORAGE MANAGEMENT node strategy to maintain k replicas In Pastry, neighboring nodes exchange keep- alive message If period T is passed, leaf nodes removes the failed node from leaf set includes a live node with next closest noidId File strategy for node joining and dropping in leaf sets if failed node is one of k nodes for certain files (primary or diverted replica holder), re-creating replicas held by failed node To cope with diverter failure – replicate diversion pointers Optimization – joining node may, instead of requesting all its replicas, install a pointer to the previous replica holder in file table (like replica diversion). Than gradual migration
35
35 STORAGE MANAGEMENT Fragmenting and File encoding In Reed-Solomon encoding, to increase high availability Fragmentation: improves equal disk utilization improves bandwidth – parallel download Higher latency to contact several nodes for retreaval
36
36 CACHING GOAL : minimizing client access latency, maximizing query throughput, balancing query load Create and maintain additional copies of highly popular file in “ unused ” disk space of nodes During successful insertion and lookup, on all routed nodes GreedyDual-Size (GD-S) policy for replacement Applying H f (=cost(f)/size(f)) value to each cached file File with lowest H f is replaced
37
37 Security in PAST Smartcard – private/public key scheme ensure nodeId / fileId assignment integrity Against a malicious node Getting store receipt – prevent fewer than k replicas File certificate – verify the authenticity of file content File privacy by clients encryption Signing routing tables entries Randomizing the routing scheme, to prevent DOS Can not completely prevent malicious node to suppress valid entries
38
38 EXPERIMENTAL RESULTS Effects of Storage Management No diversion (t pri = 1, t div = 0): max utilization 60.8% 51.1% inserts failed - leaf set size : effect of local load balancing Replica/file diversion (t pri = 0.1, t div =.05): max utilization > 98% < 1% inserts failed -Policy- Accept a file if file_size / free_space < t
39
39 EXPERIMENTAL RESULTS Determine Threshold Values Insertion Statistics and Utilization as t pri varied, t div = 0.05 Insertion Statistics and Utilization as t div varied, t pri = 0.1 -Policy- Accept a file if file_size / free_space < t As t pri increases, fewer files are successfully inserted, but higher storage utilization is achieved The lower t pri, the less likely that large file can be stored, therefore many small files can be stored instead. Util drops, cause large files are rejected at low utilization levels As t div increases, storage utilization improves, but fewer files are successfully inserted,
40
40 EXPERIMENTAL RESULTS Impact of file and replica diversion File diversions are negligible for storage utilization below 83% Number of replica diversions is small even at high utilization: at 80% utilization less than 10% replicas are diverted => The overhead imposed by replica and file diversions is small as long as utilization is below 95%
41
41 EXPERIMENTAL RESULTS File Insertion Failure File insertion failures vs. storage utilization Utilization vs. Smaller files’ failure Failure ratio increases from 90% Utilization Failed insertions are heavily biased towards large files
42
42 EXPERIMENTAL RESULTS Caching Global cache hit ratio and average number of message hops Dropping hit ratio : Storage Util. and file number increases, replace files in caches hit ratio ↓ -> routing hops ↑ log 16 2250 = 3
43
43 CONCLUSION Design and evaluation of PAST Storage Management, Caching Nodes and files are assigned uniformly distributed ID Replicas of file stored at k nodes closest to fildId Experimental results Achieve storage utilization of 98% Low file insertion failure ratio at high storage utilization Effective caching achieves load balancing
44
44 Weakness Does not support mutable files – read only No searching, directory lookup Local fault in segment of network may cause functioning node not to be able to contact outside world, since its routing table is mainly local No direct support for anonymity or confidentiality Breaking large node apart – is it good or bad? Simulation is too sterile No experimental comparison of PAST to other systems
45
45 Comparison to other systems
46
46 Comparison PASTRY compared to Freenet and Gnutella: Guaranteed answer in bounded number of steps, while retaining scalabilty of Freenet and self-organization of Freenet and Gnutella PASTRY Compared to Chord Chord makes no explicit effort to achieve good network locality PAST compared to OceanStore PAST has no support for mutable files, searching, directory lookup more sophisticated storage semantics could be build on top of PAST Pastry (and Tapestry) are similar to Plaxton: routing based on prefixes, generalization of hypercube routing Plaxton is not self organizing; one node associated per file, thus single point of failure
47
47 Comparison PAST compared to FarSite FarSite has traditional file system semantics, distributed directory service to locate content. Every node maintains partial list of live nodes, from which it chooses nodes to store replicas LAN assumptions of FarSite may not hold in a wide-area environment PAST compared to CFS CFS built on top of Chord File sharing medium, block oriented, read only Each block is stored on multiple nodes with adjacent Chord nodeIds, caching of popular blocks Increased file retrieval overhead Parallel block retrieval good for large files CFS assumes abundance of free disk space Relies on hosting multiple logical nodes in one physical Chord node, with separate ids, in order to accommodate nodes with big storage capacity => increasing query overhead
48
48 Comparison PAST compared to LAND Expected constant number of outgoing links in each node Constant number of pointers to each object Constant bound on distortion (stretch): accumulative route cost divided by distance cost Links choice enforces distance upper bound on each stage of the route LAND uses two tier architecture: super-nodes
49
49 The END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.