Download presentation
Presentation is loading. Please wait.
Published bySolomon Dickerson Modified over 9 years ago
1
1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens Dec. 7, 2006 Laboratoire d’Informatique de Paris 6 (LIP6 Computer Science Lab.)
2
2 1.Introduction 2.Peer-to-peer file systems DHTs previous designs 3.Pastis file system data structures, updates, consistency 4.Experimental evaluation Pastis prototype performance 5.Predicting durability Markov chain model parameter estimation problem 6.Conclusions Outline
3
3 P2P storage system Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Goals large aggregate capacity low cost (shared ownership) high durability (persistence) high availability self-organization
4
4 P2P storage system (cont.) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Challenges large-scale routing high latency, low bandwidth node disconnections/failures consistency security
5
5 Distributed file systems Client-serverP2P LAN (100) NFS- Organization (10.000) AFSFARSITE Pangaea Internet (1.000.000) -Ivy * Oceanstore * scalability (number of nodes) architecture Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions * uses a Distributed Hash Table (DHT) to store data
6
6 Distributed Hash Tables Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Persistent storage service simple API put(key, object) object = get(key) decentralization scalability (e.g., O( log N )) fault-tolerance (replication) self-organization 04F2 3A79 5230 8909 8954 8957 AC78 C52A E25A put(8955,obj) overlay address space [0000,FFFF] 620F 85E0 obj replicas
7
7 DHT-based file systems Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Ivy [OSDI’02] log-based, one log per user fast writes, slow reads limited to small number of users Oceanstore [FAST’03] updates serialized by primary replicas partial centralization BFT agreement requires well-connected primary replicas primary replicas secondary replicas User A’s log User B’s log User C’s log DHT block DHT block DHT block
8
8 P2P file systems decentralizationInternet scalelarge number of users IvyXX OceanstoreXX FARSITEX PangaeaXX Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions none of the systems presents all three characteristics
9
9 PASTIS FILE SYSTEM Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
10
10 Pastis file system Characteristics decentralization scalable number of nodes number of users high durability performance security Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions the Pastis layer abstracts from most problems due to distribution and large-scale open, read, write, lseek, etc. application Pastis PAST DHT put, get persistence fault-tolerance self-organization directories POSIX API user groups, ACLs
11
11 Pastis data structures Data structures similar to the Unix file system inodes are stored in modifiable DHT blocks (UCBs) file contents are stored in immutable DHT blocks (CHBs) metadata block addresses UCB file inode CHB1 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions CHB2 file contents UCB CHB1 CHB2 replica sets DHT address space
12
12 Pastis data structures (cont.) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions directories contain entries use indirect blocks for large files metadata block addresses UCB directory inode CHB file1, key1 file2, key2 … metadata block addresses UCB file1 inode CHB old contents CHB indirect block CHB file contents CHB old contents CHB file contents
13
13 DHT blocks used by Pastis Content Hash Block key = Hash(contents) self-verifying the block is immutable contents Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions User Certificate Block contents signed by writer certificate proves write access the block is modifiable contents version number sign(K writer_private ) K writer_public K block_public expiration date sign(K block_private ) certificate UCB key: H(K block_public ) CHB key: H(contents) used for consistency
14
14 Pastis updates Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions User modifies the file contents DHT access retrieve CHB storing old file contents → CHB get a new CHB is created and stored in the DHT → CHB put the pointer in the file inode is updated → UCB put metadata version 1 block addresses UCB file inode CHB1 old contents CHB2 new contents * old CHB is kept in the DHT to allow relaxed consistency models * metadata version 2 block addresses version is increased block storing modified offset
15
15 Pastis consistency models Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Close-to-open [AFS] defer update propagation until file is closed file open must return the most recently closed version in the system ignore updates from other clients while file is open Implementation file open: fetch all inode replicas from DHT and keep most recent version file close: update inode replicas in the DHT Read-your-writes [Bayou] file open must return a version consistent with previous local writes Implementation: file open: fetch one inode replica not older than those accessed previously read-your-writes useful when files are seldom shared
16
16 PASTIS EVALUATION Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
17
17 Pastis architecture Prototype uses FreePastry (PAST open-source impl.) two interfaces for applications NFS loop-back server Java Class ~ 10.000 lines of Java Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions prototype architecture
18
18 Pastis evaluation Emulation prototype runs on a local cluster router adds constant latency between nodes controlled environment LS 3 Simulator [Busca] intercept calls to transport layer, adds latency (sphere topology) executes the same upper layer code up to 80.000 DHT nodes with 2 GB RAM Andrew Benchmark Phase 1: create subdirectories Phase 2: write to files Phase 3: read file attributes Phase 4: read file contents Phase 5: execute “make” command Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions routerN1N2N3N4
19
19 Pastis performance with concurrent clients Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object close-to-open consistency Ivy’s read overhead increases rapidly with the number of users (the client must retrieve the records of more logs) normalized execution time [sec.] (each running an independent benchmark) every user writing to FS
20
20 Pastis consistency models Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object Pastis1.9 Ivy [OSDI’02]2.0 – 3.0 Oceanstore [FAST’03]2.55 execution time [sec.] performance penalty compared to NFS (close-to-open) Pastis (close-to-open) Pastis (read-your-writes) NFSv3 (dirs) (write) (attr.) (read) (make)
21
21 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration 100 DHT nodes transit-stub topology (Modelnet) 200 AS, 50 stubs node bandwidth: 10 MBit/s close-to-open consistency Impact of replication factor execution time [sec.] higher replication factor → more replicas transmitted when inserting a CHB/UCB or when retrieving an UCB
22
22 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration sphere topology (LS 3 simulator) 300 ms. max latency 16 replicas per object close-to-open consistency Impact of network size execution time [sec.] reason:in the PAST DHT, the lookup latency is mainly due to the last hop
23
23 Pastis evaluation summary Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions graceful degradation with number of concurrent users (cf. Ivy) slightly faster than Ivy and Oceanstore with one active Andrew Bench. read-your-writes model 30% faster than close-to-open latency increases with replication factor latency remains constant with DHT size
24
24 PREDICTING DURABILITY Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
25
25 Durability vs. availability Data availability metric: percentage of time that data is accessible (e.g, 99.9%) depends on the rate of replica temporary disconnections Data durability metric: probability of data loss (e.g., 0.1% after one year from insertion) depends on the rate of disk failures data can be stored durably despite being temporarily unavailable (i.e., all replicas off-line) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions both problems studied in thesis, will only discuss durability in talk
26
26 Durability in P2P systems Problem: restoring a crashed hard disk may take a long time Example:300 GB per node, 1 Mbit/s connection Time needed to restore the hard disk contents: 1 month tr1tr1 tc1tc1 t c2 Replica 1 Replica 2 irrecoverable data loss Classical solution replicate objects on different nodes (avoid correlated failures) restore replicas after a disk crash 1 month t Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
27
27 Analytical model for durability Goals create an analytical model to predict the probability of data loss (instead of using simulations) Benefits model can be used to determine the minimum number of replicas to achieve a given level of durability model avoids new simulations each time system parameters change (repair bandwidth, node capacity, etc.) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
28
28 Using Markov chains to predict durability Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Model the number of replicas present on the network the chain models one data object (replicated k times) k+1 states, 0 to k state transitions: replica failure → from state i to i-1 with rate λ i replica repair → from state i to i+1 with rate µ i state probabilities for t = 0: P k (0) = 1, P i (0) = 0 for i < k P LOSS (t) = P 0 (t) (prob. of reaching the absorbing state 0 at time t)
29
29 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Estimating the model parameters Express transition rates using coefficients α i and β i Failure rates state 1(one live replica) λ= 1 / MTBF state i(i live replicas) β i = i Repair rates state i = k-1(one missing replica) µ= 1 / T repair T repair = ? state i = k-m(m missing replicas) α m = ? What are the values of T repair and α m ?
30
30 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Ramadhadran et al. [INFOCOM’06] α m = m (m = number of missing replicas) no expression for T repair (assumed to be configurable) Chun et al. [NSDI’06] α m = 1 b node capacity [bytes] bwnode bandwidth [bytes/sec] T repair = b / bw m αmαm Parameters proposed by other authors Ramadhadran et al. Chun et al. Are these expressions for T repair and α m accurate? too simplistic number of missing replicas
31
31 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Our parameter estimation Analytical estimation of T repair multiple crashes a node may crash again before it finishes restoring its disk bandwidth sharing the uploader’s bandwidth bw may be shared among several downloaders this effect increases with a smaller MTBF (crashes are more frequent → more nodes are in the restoring state) Our value of T repair depends on three parameters: b, bw, MTBF Estimation for α m the coefficients α m are very hard to obtain analytically measure α m with a simulator
32
32 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Simulator DHT protocol simplified version of the PAST DHT object replicas stored on ring neighbors crashed node restores objects sequentially object is downloaded from random source Assumptions symmetric node bandwidth node inter-failure times drawn from exponential distribution with mean MTBF high node availability (temporary disconnections ignored) conservative choice
33
33 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions bw = 1.5 Mbit/sMTBF = 2 monthsk = {3,7}b = variable Validation of our expression for T repair T repair [days] estimation error
34
34 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions αmαm b = 250 GB per nodebw = 1.5 Mbit/sMTBF = 2 monthsk = {5,7,9} m Measured values of α m f(m)f(m) Empirical function f(m) f(m) can be used to obtain the model parameters from the system characteristics b, bw, k, and MTBF plateau due to fewer remaining replicas to download the object from
35
35 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Predicted loss probability b = 250 GB per nodebw = 1.5 Mbit/sMTBF = 2 monthsk = 5 P LOSS (t) time after insertion [years] t
36
36 Conclusion Pastis: new P2P file system completely decentralized scalable number of nodes and users relaxed consistency models close-to-open, read-your-write performance equal or better than Ivy and Oceanstore durability prediction use Markov chains to estimate probability of object loss previous model parameters are inaccurate more accurate parameter set based on new analytical expression empirical data Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
37
37 Future work Pastis explore new consistency models flexible replica location evaluation in a wide-area testbed (Planetlab) Durability refine model to consider other DHT protocols impact of temporary node disconnections Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions
38
38 Questions?
39
39 Pastis updates Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions User modifies the file contents DHT access retrieve CHB storing old file contents → CHB get a new CHB is created and stored in the DHT → CHB put the pointer in the file inode is updated → UCB put metadata version 1 block addresses UCB file inode CHB old contents CHB new contents * old CHB is kept in the DHT to allow relaxed consistency models metadata version 2 block addresses version is increased CHB old contents * CHB old contents *
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.