Wide-Area Cooperative Storage with CFS Morris et al. Presented by Milo Martin UPenn Feb 17, 2004 (some slides based on slides by authors)

Slides:



Advertisements
Similar presentations
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Scalable Content-Addressable Network Lintao Liu
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Chord A Scalable Peer-to-peer Lookup Service for Internet Applications
Distributed Hash Tables: Chord Brad Karp (with many slides contributed by Robert Morris) UCL Computer Science CS M038 / GZ06 27 th January, 2009.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
1 CS 268: Lecture 22 DHT Applications Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University of California,
1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
What is a P2P system? A distributed system architecture: No centralized control Nodes are symmetric in function Large number of unreliable nodes Enabled.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Chord and CFS Philip Skov Knudsen Niels Teglsbo Jensen Mads Lundemann
Distributed Lookup Systems
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation.
Wide-area cooperative storage with CFS
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Lecture 10 Naming services for flat namespaces. EECE 411: Design of Distributed Software Applications Logistics / reminders Project Send Samer and me.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Freenet: A Distributed Anonymous Information Storage and Retrieval System Presentation by Theodore Mao CS294-4: Peer-to-peer Systems August 27, 2003.
Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.
Designing a DHT for low latency and high throughput Robert Vollmann P2P Information Systems.
Wide-area cooperative storage with CFS Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Ivy: A Read/Write Peer-to-Peer File System A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen In Proceedings of OSDI ‘ Presenter : Chul Lee.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Dr. Yingwu Zhu.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Peer-to-Peer (P2P) File Systems. P2P File Systems CS 5204 – Fall, Peer-to-Peer Systems Definition: “Peer-to-peer systems can be characterized as.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Peer-to-Peer Information Systems Week 12: Naming
Distributed Hash Tables
(slides by Nick Feamster)
Plethora: Infrastructure and System Design
Providing Secure Storage on the Internet
Peer-to-Peer (P2P) File Systems
Presentation by Theodore Mao CS294-4: Peer-to-peer Systems
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
Peer-to-Peer Storage Systems
Consistent Hashing and Distributed Hash Table
Peer-to-Peer Information Systems Week 12: Naming
Presentation transcript:

Wide-Area Cooperative Storage with CFS Morris et al. Presented by Milo Martin UPenn Feb 17, 2004 (some slides based on slides by authors)

Overview Problem: content distribution Solution: distributed read-only file system Implementation Provide file system interface Using a Distributed Hash Table (DHT) Extend Chord with redundancy and caching

Problem: Content Distribution Serving static content with inexpensive hosts open-source distributions off-site backups tech report archive node Internet node

Example: mirror open-source distributions Multiple independent distributions Each has high peak load, low average Individual servers are wasteful Solution: Option 1: single powerful server Option 2: distributed service But how do you find the data?

Assumptions Storage is cheap and plentiful Many participants Many reads, few updates Heterogeneous nodes Storage Bandwidth Physical locality Exploit for performance Avoid for resilience

Goals Avoid hot spots due to popular content Distribute the load (traffic and storage) “Many hands make light work” High availability Using replication Data integrity Using secure hashes Limited support for updates Self managing/repairing

CFS Approach Break content into immutable “blocks” Use blocks to build a file system Identify blocks via secure hash Unambiguous name for a block Self-verifying Distributed Hash Table (DHT) of blocks Distribute blocks Replicate and cache blocks Algorithm for finding blocks (e.g., Chord) Content search beyond the scope

Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion

Hash-based Read-only File System Assume retrieval substrate put(data): inserts data with key Hash(data) get(h) -> data Assume “root” identifier known Build a read-only file system using this interface Based on SFSRO [OSDI 2000]

Client-server interface Files have unique names Files are read-only (single writer, many readers) Publishers split files into blocks Clients check files for authenticity [SFSRO] FS Clientserver Insert file f Lookup file f Insert block Lookup block node server node

File System Structure Build file system bottom up Break files into blocks Hash each block Create a “file block” that points to hashes Create “directory blocks” that point to files Ultimately arrive at “root” hash SFSRO: blocks can be encrypted

File System Example File Block …… Data Block Directory Block …… Root Block … …

File System Lookup Example Root hash verifies data integrity Recursive verification Root hash is correct, can verify data Prefetch data blocks …… File Block Data Block Directory Block Root Block …… … …

File System Updates Updates supported, but kludgey Single updater All or nothing updates Export a new version of the filesystem Requires a new root hash Conceptually rebuild file system Insert modified blocks with new hashes Introduce signed “root block” Key is hash of public key (not the hash of contents) Allow updates if they are signed with private key Sequence number prevents replay

Update Example …… File Block Data Block Directory BlockRoot Block …… … … New Data Hash(Public Key) File BlockDirectory BlockRoot Block Unmodified

File System Summary Simple interface put(data): inserts data with key Hash(data) get(h) -> data put_root(sign private (hash, seq#), public key) refresh(key): allows discarding of stale data Recursive integrity verification

Review of Goals and Assumptions Data integrity Using secure hashes Limited support for updates Distribute load; avoid hot spots High availability Self managing/repairing Exploit locality Assumptions Many participants Heterogeneous nodes File System Layer

Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion

Server Structure DHash stores, balances, replicates, caches blocks DHash uses Chord [SIGCOMM 2001] to locate blocks DHash Chord Node 1Node 2 DHash Chord

Chord Hashes a Block ID to its Successor N32 N10 N100 N80 N60 Circular ID Space Nodes and blocks have randomly distributed IDs Successor: node with next highest ID B33, B40, B52 B11, B30 B112, B120, …, B10 B65, B70 B99 Block ID Node ID

Basic Lookup - Linear Time N32 N10 N5 N20 N110 N99 N80 N60 N40 “Where is block 70?” “N80” Lookups find the ID’s predecessor Correct if successors are correct

Successor Lists Ensure Robust Lookup N32 N10 N5 N20 N110 N99 N80 N60 Each node stores r successors, r = 2 log N Lookup can skip over dead nodes to find blocks N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, , 110, 5 110, 5, 10 5, 10, 20

Chord Finger Table Allows O(log N) Lookups N80 ½ ¼ 1/8 1/16 1/32 1/64 1/128 See [SIGCOMM 2001] for table maintenance Reasonable lookup latency

DHash/Chord Interface lookup() returns list with node IDs closer in ID space to block ID Sorted, closest first server DHash Chord Lookup(blockID)List of finger table with

DHash Uses Other Nodes to Locate Blocks N40 N10 N5 N20 N110 N99 N80 N50 N60 N68 Lookup(BlockID=45)

Availability and Resilience: Replicate blocks at r successors N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Hashed IP Addr. ensures independent replica failure High storage cost for replication; does this matter? original replicas

Lookups find replicas N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N Lookup(BlockID=17) RPCs: 1. Lookup step 2. Get successor list 3. Failed block fetch 4. Block fetch original replicas

First Live Successor Manages Replicas N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Copy of 17 Node can locally determine that it is the first live successor; automatically repair and re-replicate

Reduce Overheads: Caches Along Lookup Path N40 N10 N5 N20 N110 N99 N80 N60 Lookup(BlockID=45) N50 N RPCs: 1. Chord lookup 2. Chord lookup 3. Block fetch 4. Send to cache Send only to second-to-last hop in routing path

Caching at Fingers Limits Load N32 Only O(log N) nodes have fingers pointing to N32 This limits the single-block load on N32

Virtual Nodes Allow Heterogeneity Hosts may differ in disk/net capacity Hosts may advertise multiple IDs Chosen as SHA-1(IP Address, index) Each ID represents a “virtual node” Host load proportional to # v.n.’s Manually controlled automatic adaptation possible Node A N60N10N101 Node B N5

Physical Locality-aware Path Choice N80 N48 100ms 10ms Each node monitors RTTs to its own fingers Pick smallest RTT (a greedy choice) Tradeoff: ID-space progress vs delay N25 N90 N96 N18N115 N70 N37 N55 50ms 12ms Lookup(47) B47

Why Blocks Instead of Files? Cost: one lookup per block Can tailor cost by choosing good block size Need prefetching for high throughput What is a good block size? Higher latency? Benefit: load balance is simple For large files Storage cost of large files is spread out Popular files are served in parallel

Block Storage Long-term blocks are stored for a fixed time Publishers need to refresh periodically Cache uses LRU disk: cacheLong-term block storage

Preventing Flooding What prevents a malicious host from inserting junk data? Answer: not much Kludge: capacity limit (e.g., 0.1%) Per source/destination pair How real is this problem? Refresh requirement allows recovery

Review of Goals and Assumptions Data integrity Using secure hashes Limited support for updates Distribute load; avoid hot spots High availability Self managing/repairing Exploit locality Assumptions Many participants Heterogeneous nodes File System Layer Dhash/ Chord Layer

Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion

CFS Project Status Working prototype software Some abuse prevention mechanisms SFSRO file system client Guarantees authenticity of files, updates, etc. Some measurements on RON testbed Simulation results to test scalability

Experimental Setup (12 nodes) One virtual node per host 8Kbyte blocks RPCs use UDP To vu.nl lulea.se ucl.uk To kaist.kr,.ve Caching turned off Proximity routing turned off

CFS Fetch Time for 1MB File Average over the 12 hosts No replication, no caching; 8 KByte blocks Fetch Time (Seconds) Prefetch Window (KBytes)

Distribution of Fetch Times for 1MB Fraction of Fetches Time (Seconds) 8 Kbyte Prefetch 24 Kbyte Prefetch40 Kbyte Prefetch

CFS Fetch Time vs. Whole File TCP Fraction of Fetches Time (Seconds) 40 Kbyte Prefetch Whole File TCP

Robustness vs. Failures Failed Lookups (Fraction) Failed Nodes (Fraction) (1/2) 6 is Six replicas per block;

Much Related Work SFSRO (Secure file system, read only) Freenet Napster Gnutella PAST CAN …

Later Work: Read/Write Filesystems Ivy and Eliot Extend the idea to read/write file systems Full NFS or AFS-like semantics All the nasty issues of distributed file systems Consistency Partitioning Conflicts Oceanstore (earlier work, actually)

Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion

CFS Summary (from their talk) CFS provides peer-to-peer r/o storage Structure: DHash and Chord It is efficient, robust, and load-balanced It uses block-level distribution The prototype is as fast as whole-file TCP

OceanStore (Similarities to CFS) Ambitious global-scale storage “utility” built from an overlay network Similar goals: Distributed, replicated, highly available Explicit caching Similar solutions: Hash-based identifiers Multi-hop overlay-network routing

OceanStore (Differences) “Utility” model Implies some managed layers Handles explicit motivation of participants Many applications/interfaces Filesystem, web content, PDA sync.,

OceanStore (More Differences) Explicit support for updates Built into system at almost every layer Uses Byzantine commit Requires “maintained” inner ring Updates on encrypted data Plaxton tree based routing Fast, probabilistic method Slower, reliable method Erasure codes limit replication overhead

Aside: Research Approach OceanStore Original ASPLOS “vision paper” Paints the big picture Spent 3+ fleshing it out Chord Evolution from Chord -> CFS -> Ivy More follow-on research by others

Issues and Discussion Kludge or pragmatic? CFS’s replication CFS’s caching (is “closer” server better?) CFS’s multiple virtual servers CFS’s Locality-based routing Are “root” blocks cacheable? Bandwidth is reasonable… …what about latency? Block-based or File-based? How separate are the layers, really?