Download presentation
Presentation is loading. Please wait.
Published byJeremy McBride Modified over 9 years ago
1
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6 – CNRS, Paris, France INRIA, Rocquencourt, France
2
2 JTE HPC/FS 1.DHT-based File Systems 2.Pastis 3.Performance evaluation Outline
3
3 JTE HPC/FS Distributed file systems Client-serverP2P LAN (100) NFS- Organization (10.000) AFSFARSITE Pangaea Internet (1.000.000) -Ivy * Oceanstore * Pastis * scalability (number of nodes) architecture * uses a Distributed Hash Table (DHT) to store data
4
4 JTE HPC/FS Distributed Hash Tables 52 24 75 40 18 91 32 66 83
5
5 JTE HPC/FS DHTs logical address space 52 24 75 40 18 91 32 66 83 South America North America Australia Asia Europe Asia high latency, low bandwidth between logical neighbors Overlay network
6
6 JTE HPC/FS Insertion of blocks in DHT 04F2 5230 834B C52A 8909 8BB2 3A79 8954 8957 AC78 895D E25A 04F2 3A79 5230 834B 8909 8954 8957 8BB2 AC78 C52A E25A k = 8958 k = 8959 put(8959,block) root of key 8959 block Address space replica 895D replica
7
7 JTE HPC/FS Insertion of blocks in DHT 04F2 5230 834B C52A 8909 8BB2 3A79 8954 8957 AC78 895D E25A 04F2 3A79 5230 834B 8909 8954 8957 8BB2 AC78 C52A E25A block Address space replica 895D replica k = 8958 k = 8959 get(8959,block)
8
8 JTE HPC/FS P2P File systems architecture put(key, block) block = get(key) files and directories read-write access semantics security and access control DHash / Past Ivy / Pastis DHT FS - scalability - fault-tolerance - self-organization block store (DHT) message routing open(), read(), write(), close(), etc.
9
9 JTE HPC/FS DHT-based file systems Ivy [OSDI’02] log-based, one log per user fast writes, slow reads limited to small number of users Oceanstore [FAST’03] updates serialized by primary replicas partially centralized system BFT agreement protocol requires well-connected primary replicas primary replicas secondary replicas User A’s log User B’s log User C’s log DHT object DHT object DHT object
10
10 JTE HPC/FS Pastis
11
11 JTE HPC/FS Pastis design Design goals simple completely decentralized scalable (network size and number of users) put(key, block) block = get(key) Pastry Past Pastis DHT FS storage routing
12
12 JTE HPC/FS Pastis data structures Data structures similar to the Unix file system inodes are stored in modifiable DHT blocks (UCBs) file contents are stored in immutable DHT blocks (CHBs) metadata block addresses UCB file inode CHB1 CHB2 file contents UCB CHB1 CHB2 replica sets DHT address space Inode key
13
13 JTE HPC/FS Pastis data structures (cont.) directories contain entries use indirect blocks for large files metadata block addresses UCB directory inode CHB file1, key1 file2, key2 … metadata block addresses UCB file1 inode CHB old contents CHB indirect block CHB file contents CHB old contents CHB file contents
14
14 JTE HPC/FS Content Hash Block block contents determine block key can detect if block is modified block is immutable data block block key = Hash( block contents ) block contents
15
15 JTE HPC/FS User Certificate Blocks (KB pub, KB priv ) associated to each block (KU pub, KU priv ) associated to each user Certificate grants write access to a given user (identified by KUpub) issued by the file owner expiration date allows access revocation Authentication Verify signature of certificate using the storage key (KB pub ) Verify signature of UCB using the KU pub sign(KU priv ) KU pub expiration date sign(KB priv ) timestamp UCB certificate block key = Hash( KB pub ) inode contents KB pub
16
16 JTE HPC/FS Pastis – Update handling File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs) metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii UCB1 directory inode file1@UCB2 file2@UCB3 file3 … … CHB1 directory contents metadata @CHB3 … UCB2 file inode foo CHB3 file contents
17
17 JTE HPC/FS Pastis – Update handling File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs) metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii directory inode CHB1 directory contents metadata @CHB3 … file inode foo CHB3 file contents foo bar CHB4 new file contents Insert new CHB into the DHT UCB1UCB2 file1@UCB2 file2@UCB3 file3 … …
18
18 JTE HPC/FS Pastis – Update handling File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs) metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii directory inode CHB1 directory contents metadata @CHB4 … file inode foo CHB3 file contents foo bar CHB4 new file contents Update file inode to point to new CHB file1@UCB2 file2@UCB3 file3 … … UCB1UCB2
19
19 JTE HPC/FS Pastis – Update handling File update insert the new file contents (CHBs) reinsert the file inode (UCB) replace data blocks (CHBs) metadata @CHB1 @CHB2 … @CHBi @CHBii @CHBiii directory inode CHB1 directory contents metadata @CHB4 … file inode foo CHB3 file contents foo bar CHB4 new file contents Reinsert inode UCB into the DHT file1@UCB2 file2@UCB3 file3 … … UCB1UCB2
20
20 JTE HPC/FS Pastis – Consistency Strict consistency → too expensive, requires too many network accesses Close-to-open consistency open(): returns the latest version of the file commited by close() between open() and close(): user only sees his own updates defer writes until file is closed Client A openread ‘1’ open write ‘2’ read ‘1’openread ‘2’ close write is cached until close (CHBs and inode UCB are stored in a local buffer) Client B write ‘2’ is sent to the network (CHBs and UCB and inserted into the DHT) a “close-to-open” path makes updates visible B retrieves inode from the DHT Still quite expensive: an open requires retrieving the most up-to-date inode replica
21
21 JTE HPC/FS Pastis – Consistency Read-your-writes consistency relaxation of the close-to-open model read() must reflect previous local writes only writes from other clients may or may not be visible Client A openread ‘1’ openwrite ‘2’openread ‘2’ close Client B read must reflect local previous writes openread ‘1’ A’s read may not reflect B’s writes An open does not require retrieving the most up-to-date inode replica, just fetch one inode replica not older than those accessed previously
22
22 JTE HPC/FS Evaluation
23
23 JTE HPC/FS Evaluation Prototype programmed in Java Client interface : NFS, Fuse Test program: Andrew Benchmark Phase 1: create subdirectories Phase 2: copy files Phase 3: read file attributes Phase 4: read file contents Phase 5: make command Emulation LAN with one DHT node per machine DummyNet router emulates WAN latencies Simulation discrete event simulator - LS 3 simulates overlay network latency
24
24 JTE HPC/FS Pastis performance with concurrent clients Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object close-to-open consistency Ivy’s read overhead increases rapidly with the number of users (the client must retrieve the records of more logs) normalized execution time [sec.] (each running an independent benchmark) every user reading and writing to FS
25
25 JTE HPC/FS Pastis consistency models Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object Pastis1.9 Ivy [OSDI’02]2.0 – 3.0 Oceanstore [FAST’03]2.55 execution time [sec.] performance penalty compared to NFS (close-to-open) Pastis (close-to-open) Pastis (read-your-writes) NFSv3 (dirs) (write) (attr.) (read) (make)
26
26 JTE HPC/FS Evaluation: consistency models N = 32768, sphere topology, max. latency: 300 ms, k = 16 CTO RYW with 10% stale UCB replicas
27
27 JTE HPC/FS Conclusion Pastis simple completely decentralized (cf. Oceanstore) scalable number of users (cf. Ivy) good performance thanks to: PAST-Pastry’s locality properties relaxed consistency models (close-to-open, read-your- writes) Future work explore new consistency models flexible replica location evaluation in a wide-area testbed (Planetlab)
28
28 JTE HPC/FS Links Pastis : http://regal.lip6.fr/projects/pastis Pastry, Past : http://freepastry.rice.edu LS 3 : http://regal.lip6.fr/projects/pastis/ls3
29
29 JTE HPC/FS Questions?
30
30 JTE HPC/FS Internet Blocks distribution root replication Past / Pastry overlay Pastis FS Pastis design
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.