1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens.

Slides:



Advertisements
Similar presentations
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Advertisements

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
SUNDR: Secure Untrusted Data Repository
© 2005 Andreas Haeberlen, Rice University 1 Glacier: Highly durable, decentralized storage despite massive correlated failures Andreas Haeberlen Alan Mislove.
Pond: the OceanStore Prototype CS 6464 Cornell University Presented by Yeounoh Chung.
Pond The OceanStore Prototype. Pond -- Dennis Geels -- January 2003 Talk Outline System overview Implementation status Results from FAST paper Conclusion.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Ivy: A Read/Write P2P File System Athicha Muthitacharoan, Robert Morris, Thomer Gil and Benjie Chen Presented by Rachel Rubin CS 294-4, Fall 2003.
Small-world Overlay P2P Network
Company Confidential 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Towards a mobile content delivery network with a P2P architecture Carlos Quiroz.
Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.
Other File Systems: LFS and NFS. 2 Log-Structured File Systems The trend: CPUs are faster, RAM & caches are bigger –So, a lot of reads do not require.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
A Peer-to-Peer File System OSCAR LAB. Overview A short introduction to peer-to-peer (P2P) Systems Ivy: a read/write P2P file system (OSDI’02)
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Wide-area cooperative storage with CFS
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han Haichen Shen, Taesoo Kim*, Arvind Krishnamurthy,
A Low-Bandwidth Network File System A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, NYU.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
1 BitHoc: BitTorrent for wireless ad hoc networks Jointly with: Chadi Barakat Jayeoung Choi Anwar Al Hamra Thierry Turletti EPI PLANETE 28/02/2008 MAESTRO/PLANETE.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
UbiStore: Ubiquitous and Opportunistic Backup Architecture. Feiselia Tan, Sebastien Ardon, Max Ott Presented by: Zainab Aljazzaf.
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.
Ivy: A Read/Write Peer-to-Peer File System A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen In Proceedings of OSDI ‘ Presenter : Chul Lee.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Peer-to-Peer Systems: An Overview Hongyu Li. Outline  Introduction  Characteristics of P2P  Algorithms  P2P Applications  Conclusion.
CS 540 Database Management Systems
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Malugo – a scalable peer-to-peer storage system..
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.
Peer-to-Peer (P2P) File Systems. P2P File Systems CS 5204 – Fall, Peer-to-Peer Systems Definition: “Peer-to-peer systems can be characterized as.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
Data Management on Opportunistic Grids
Mohammad Malli Chadi Barakat, Walid Dabbous Alcatel meeting
Providing Secure Storage on the Internet
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
The Design and Implementation of a Log-Structured File System
Presentation transcript:

1 Phenix Workshop on Global Computing Systems Pastis, a peer-to-peer file system for persistent large-scale storage by Fabio Picconi advisor Pierre Sens Dec. 7, 2006 Laboratoire d’Informatique de Paris 6 (LIP6 Computer Science Lab.)

2 1.Introduction 2.Peer-to-peer file systems  DHTs  previous designs 3.Pastis file system  data structures, updates, consistency 4.Experimental evaluation  Pastis prototype  performance 5.Predicting durability  Markov chain model  parameter estimation problem 6.Conclusions Outline

3 P2P storage system Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Goals  large aggregate capacity  low cost (shared ownership)  high durability (persistence)  high availability  self-organization

4 P2P storage system (cont.) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Challenges  large-scale routing  high latency, low bandwidth  node disconnections/failures  consistency  security

5 Distributed file systems Client-serverP2P LAN (100) NFS- Organization (10.000) AFSFARSITE Pangaea Internet ( ) -Ivy * Oceanstore * scalability (number of nodes) architecture Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions * uses a Distributed Hash Table (DHT) to store data

6 Distributed Hash Tables Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Persistent storage service  simple API put(key, object) object = get(key)  decentralization  scalability (e.g., O( log N ))  fault-tolerance (replication)  self-organization 04F2 3A AC78 C52A E25A put(8955,obj) overlay address space [0000,FFFF] 620F 85E0 obj replicas

7 DHT-based file systems Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Ivy [OSDI’02]  log-based, one log per user  fast writes, slow reads  limited to small number of users Oceanstore [FAST’03]  updates serialized by primary replicas  partial centralization  BFT agreement requires well-connected primary replicas primary replicas secondary replicas User A’s log User B’s log User C’s log DHT block DHT block DHT block

8 P2P file systems decentralizationInternet scalelarge number of users IvyXX OceanstoreXX FARSITEX PangaeaXX Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions none of the systems presents all three characteristics

9 PASTIS FILE SYSTEM Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

10 Pastis file system Characteristics  decentralization  scalable number of nodes number of users  high durability  performance  security Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions the Pastis layer abstracts from most problems due to distribution and large-scale open, read, write, lseek, etc. application Pastis PAST DHT put, get  persistence  fault-tolerance  self-organization  directories  POSIX API  user groups, ACLs

11 Pastis data structures Data structures similar to the Unix file system  inodes are stored in modifiable DHT blocks (UCBs)  file contents are stored in immutable DHT blocks (CHBs) metadata block addresses UCB file inode CHB1 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions CHB2 file contents UCB CHB1 CHB2 replica sets DHT address space

12 Pastis data structures (cont.) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions  directories contain entries  use indirect blocks for large files metadata block addresses UCB directory inode CHB file1, key1 file2, key2 … metadata block addresses UCB file1 inode CHB old contents CHB indirect block CHB file contents CHB old contents CHB file contents

13 DHT blocks used by Pastis Content Hash Block  key = Hash(contents)  self-verifying  the block is immutable contents Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions User Certificate Block  contents signed by writer  certificate proves write access  the block is modifiable contents version number sign(K writer_private ) K writer_public K block_public expiration date sign(K block_private ) certificate UCB key: H(K block_public ) CHB key: H(contents) used for consistency

14 Pastis updates Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions User modifies the file contents DHT access  retrieve CHB storing old file contents → CHB get  a new CHB is created and stored in the DHT → CHB put  the pointer in the file inode is updated → UCB put metadata version 1 block addresses UCB file inode CHB1 old contents CHB2 new contents * old CHB is kept in the DHT to allow relaxed consistency models * metadata version 2 block addresses version is increased block storing modified offset

15 Pastis consistency models Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Close-to-open [AFS]  defer update propagation until file is closed  file open must return the most recently closed version in the system  ignore updates from other clients while file is open Implementation file open: fetch all inode replicas from DHT and keep most recent version file close: update inode replicas in the DHT Read-your-writes [Bayou]  file open must return a version consistent with previous local writes Implementation: file open: fetch one inode replica not older than those accessed previously read-your-writes useful when files are seldom shared

16 PASTIS EVALUATION Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

17 Pastis architecture Prototype  uses FreePastry (PAST open-source impl.)  two interfaces for applications NFS loop-back server Java Class  ~ lines of Java Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions prototype architecture

18 Pastis evaluation Emulation  prototype runs on a local cluster  router adds constant latency between nodes  controlled environment LS 3 Simulator [Busca]  intercept calls to transport layer, adds latency (sphere topology)  executes the same upper layer code  up to DHT nodes with 2 GB RAM Andrew Benchmark Phase 1: create subdirectories Phase 2: write to files Phase 3: read file attributes Phase 4: read file contents Phase 5: execute “make” command Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions routerN1N2N3N4

19 Pastis performance with concurrent clients Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object close-to-open consistency Ivy’s read overhead increases rapidly with the number of users (the client must retrieve the records of more logs) normalized execution time [sec.] (each running an independent benchmark) every user writing to FS

20 Pastis consistency models Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration 16 DHT nodes 100 ms constant inter-node latency (Dummynet) 4 replicas per object Pastis1.9 Ivy [OSDI’02]2.0 – 3.0 Oceanstore [FAST’03]2.55 execution time [sec.] performance penalty compared to NFS (close-to-open) Pastis (close-to-open) Pastis (read-your-writes) NFSv3 (dirs) (write) (attr.) (read) (make)

21 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration 100 DHT nodes transit-stub topology (Modelnet) 200 AS, 50 stubs node bandwidth: 10 MBit/s close-to-open consistency Impact of replication factor execution time [sec.] higher replication factor → more replicas transmitted when inserting a CHB/UCB or when retrieving an UCB

22 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Configuration sphere topology (LS 3 simulator) 300 ms. max latency 16 replicas per object close-to-open consistency Impact of network size execution time [sec.] reason:in the PAST DHT, the lookup latency is mainly due to the last hop

23 Pastis evaluation summary Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions  graceful degradation with number of concurrent users (cf. Ivy)  slightly faster than Ivy and Oceanstore with one active Andrew Bench.  read-your-writes model 30% faster than close-to-open  latency increases with replication factor  latency remains constant with DHT size

24 PREDICTING DURABILITY Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

25 Durability vs. availability Data availability  metric: percentage of time that data is accessible (e.g, 99.9%)  depends on the rate of replica temporary disconnections Data durability  metric: probability of data loss (e.g., 0.1% after one year from insertion)  depends on the rate of disk failures  data can be stored durably despite being temporarily unavailable (i.e., all replicas off-line) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions both problems studied in thesis, will only discuss durability in talk

26 Durability in P2P systems Problem: restoring a crashed hard disk may take a long time Example:300 GB per node, 1 Mbit/s connection Time needed to restore the hard disk contents: 1 month tr1tr1 tc1tc1 t c2 Replica 1 Replica 2 irrecoverable data loss Classical solution  replicate objects on different nodes (avoid correlated failures)  restore replicas after a disk crash 1 month t Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

27 Analytical model for durability Goals  create an analytical model to predict the probability of data loss (instead of using simulations) Benefits  model can be used to determine the minimum number of replicas to achieve a given level of durability  model avoids new simulations each time system parameters change (repair bandwidth, node capacity, etc.) Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

28 Using Markov chains to predict durability Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Model the number of replicas present on the network  the chain models one data object (replicated k times)  k+1 states, 0 to k  state transitions: replica failure → from state i to i-1 with rate λ i replica repair → from state i to i+1 with rate µ i  state probabilities for t = 0: P k (0) = 1, P i (0) = 0 for i < k P LOSS (t) = P 0 (t) (prob. of reaching the absorbing state 0 at time t)

29 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Estimating the model parameters Express transition rates using coefficients α i and β i Failure rates  state 1(one live replica) λ= 1 / MTBF  state i(i live replicas) β i = i Repair rates  state i = k-1(one missing replica) µ= 1 / T repair T repair = ?  state i = k-m(m missing replicas) α m = ? What are the values of T repair and α m ?

30 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Ramadhadran et al. [INFOCOM’06]  α m = m (m = number of missing replicas)  no expression for T repair (assumed to be configurable) Chun et al. [NSDI’06]  α m = 1 b node capacity [bytes] bwnode bandwidth [bytes/sec]  T repair = b / bw m αmαm Parameters proposed by other authors Ramadhadran et al. Chun et al. Are these expressions for T repair and α m accurate? too simplistic number of missing replicas

31 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Our parameter estimation Analytical estimation of T repair  multiple crashes a node may crash again before it finishes restoring its disk  bandwidth sharing the uploader’s bandwidth bw may be shared among several downloaders this effect increases with a smaller MTBF (crashes are more frequent → more nodes are in the restoring state) Our value of T repair depends on three parameters: b, bw, MTBF Estimation for α m  the coefficients α m are very hard to obtain analytically  measure α m with a simulator

32 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Simulator DHT protocol  simplified version of the PAST DHT  object replicas stored on ring neighbors  crashed node restores objects sequentially  object is downloaded from random source Assumptions  symmetric node bandwidth  node inter-failure times drawn from exponential distribution with mean MTBF  high node availability (temporary disconnections ignored) conservative choice

33 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions bw = 1.5 Mbit/sMTBF = 2 monthsk = {3,7}b = variable Validation of our expression for T repair T repair [days] estimation error

34 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions αmαm b = 250 GB per nodebw = 1.5 Mbit/sMTBF = 2 monthsk = {5,7,9} m Measured values of α m f(m)f(m) Empirical function f(m) f(m) can be used to obtain the model parameters from the system characteristics b, bw, k, and MTBF plateau due to fewer remaining replicas to download the object from

35 Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions Predicted loss probability b = 250 GB per nodebw = 1.5 Mbit/sMTBF = 2 monthsk = 5 P LOSS (t) time after insertion [years] t

36 Conclusion  Pastis: new P2P file system  completely decentralized  scalable number of nodes and users  relaxed consistency models  close-to-open, read-your-write  performance equal or better than Ivy and Oceanstore  durability prediction  use Markov chains to estimate probability of object loss  previous model parameters are inaccurate  more accurate parameter set based on  new analytical expression  empirical data Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

37 Future work  Pastis  explore new consistency models  flexible replica location  evaluation in a wide-area testbed (Planetlab)  Durability  refine model to consider  other DHT protocols  impact of temporary node disconnections Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions

38 Questions?

39 Pastis updates Introduction – P2P file systems – Pastis – Evaluation – Durability - Conclusions User modifies the file contents DHT access  retrieve CHB storing old file contents → CHB get  a new CHB is created and stored in the DHT → CHB put  the pointer in the file inode is updated → UCB put metadata version 1 block addresses UCB file inode CHB old contents CHB new contents * old CHB is kept in the DHT to allow relaxed consistency models metadata version 2 block addresses version is increased CHB old contents * CHB old contents *