Wide-Area Cooperative Storage with CFS Morris et al. Presented by Milo Martin UPenn Feb 17, 2004 (some slides based on slides by authors)
Overview Problem: content distribution Solution: distributed read-only file system Implementation Provide file system interface Using a Distributed Hash Table (DHT) Extend Chord with redundancy and caching
Problem: Content Distribution Serving static content with inexpensive hosts open-source distributions off-site backups tech report archive node Internet node
Example: mirror open-source distributions Multiple independent distributions Each has high peak load, low average Individual servers are wasteful Solution: Option 1: single powerful server Option 2: distributed service But how do you find the data?
Assumptions Storage is cheap and plentiful Many participants Many reads, few updates Heterogeneous nodes Storage Bandwidth Physical locality Exploit for performance Avoid for resilience
Goals Avoid hot spots due to popular content Distribute the load (traffic and storage) “Many hands make light work” High availability Using replication Data integrity Using secure hashes Limited support for updates Self managing/repairing
CFS Approach Break content into immutable “blocks” Use blocks to build a file system Identify blocks via secure hash Unambiguous name for a block Self-verifying Distributed Hash Table (DHT) of blocks Distribute blocks Replicate and cache blocks Algorithm for finding blocks (e.g., Chord) Content search beyond the scope
Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion
Hash-based Read-only File System Assume retrieval substrate put(data): inserts data with key Hash(data) get(h) -> data Assume “root” identifier known Build a read-only file system using this interface Based on SFSRO [OSDI 2000]
Client-server interface Files have unique names Files are read-only (single writer, many readers) Publishers split files into blocks Clients check files for authenticity [SFSRO] FS Clientserver Insert file f Lookup file f Insert block Lookup block node server node
File System Structure Build file system bottom up Break files into blocks Hash each block Create a “file block” that points to hashes Create “directory blocks” that point to files Ultimately arrive at “root” hash SFSRO: blocks can be encrypted
File System Example File Block …… Data Block Directory Block …… Root Block … …
File System Lookup Example Root hash verifies data integrity Recursive verification Root hash is correct, can verify data Prefetch data blocks …… File Block Data Block Directory Block Root Block …… … …
File System Updates Updates supported, but kludgey Single updater All or nothing updates Export a new version of the filesystem Requires a new root hash Conceptually rebuild file system Insert modified blocks with new hashes Introduce signed “root block” Key is hash of public key (not the hash of contents) Allow updates if they are signed with private key Sequence number prevents replay
Update Example …… File Block Data Block Directory BlockRoot Block …… … … New Data Hash(Public Key) File BlockDirectory BlockRoot Block Unmodified
File System Summary Simple interface put(data): inserts data with key Hash(data) get(h) -> data put_root(sign private (hash, seq#), public key) refresh(key): allows discarding of stale data Recursive integrity verification
Review of Goals and Assumptions Data integrity Using secure hashes Limited support for updates Distribute load; avoid hot spots High availability Self managing/repairing Exploit locality Assumptions Many participants Heterogeneous nodes File System Layer
Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion
Server Structure DHash stores, balances, replicates, caches blocks DHash uses Chord [SIGCOMM 2001] to locate blocks DHash Chord Node 1Node 2 DHash Chord
Chord Hashes a Block ID to its Successor N32 N10 N100 N80 N60 Circular ID Space Nodes and blocks have randomly distributed IDs Successor: node with next highest ID B33, B40, B52 B11, B30 B112, B120, …, B10 B65, B70 B99 Block ID Node ID
Basic Lookup - Linear Time N32 N10 N5 N20 N110 N99 N80 N60 N40 “Where is block 70?” “N80” Lookups find the ID’s predecessor Correct if successors are correct
Successor Lists Ensure Robust Lookup N32 N10 N5 N20 N110 N99 N80 N60 Each node stores r successors, r = 2 log N Lookup can skip over dead nodes to find blocks N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, , 110, 5 110, 5, 10 5, 10, 20
Chord Finger Table Allows O(log N) Lookups N80 ½ ¼ 1/8 1/16 1/32 1/64 1/128 See [SIGCOMM 2001] for table maintenance Reasonable lookup latency
DHash/Chord Interface lookup() returns list with node IDs closer in ID space to block ID Sorted, closest first server DHash Chord Lookup(blockID)List of finger table with
DHash Uses Other Nodes to Locate Blocks N40 N10 N5 N20 N110 N99 N80 N50 N60 N68 Lookup(BlockID=45)
Availability and Resilience: Replicate blocks at r successors N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Hashed IP Addr. ensures independent replica failure High storage cost for replication; does this matter? original replicas
Lookups find replicas N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N Lookup(BlockID=17) RPCs: 1. Lookup step 2. Get successor list 3. Failed block fetch 4. Block fetch original replicas
First Live Successor Manages Replicas N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Copy of 17 Node can locally determine that it is the first live successor; automatically repair and re-replicate
Reduce Overheads: Caches Along Lookup Path N40 N10 N5 N20 N110 N99 N80 N60 Lookup(BlockID=45) N50 N RPCs: 1. Chord lookup 2. Chord lookup 3. Block fetch 4. Send to cache Send only to second-to-last hop in routing path
Caching at Fingers Limits Load N32 Only O(log N) nodes have fingers pointing to N32 This limits the single-block load on N32
Virtual Nodes Allow Heterogeneity Hosts may differ in disk/net capacity Hosts may advertise multiple IDs Chosen as SHA-1(IP Address, index) Each ID represents a “virtual node” Host load proportional to # v.n.’s Manually controlled automatic adaptation possible Node A N60N10N101 Node B N5
Physical Locality-aware Path Choice N80 N48 100ms 10ms Each node monitors RTTs to its own fingers Pick smallest RTT (a greedy choice) Tradeoff: ID-space progress vs delay N25 N90 N96 N18N115 N70 N37 N55 50ms 12ms Lookup(47) B47
Why Blocks Instead of Files? Cost: one lookup per block Can tailor cost by choosing good block size Need prefetching for high throughput What is a good block size? Higher latency? Benefit: load balance is simple For large files Storage cost of large files is spread out Popular files are served in parallel
Block Storage Long-term blocks are stored for a fixed time Publishers need to refresh periodically Cache uses LRU disk: cacheLong-term block storage
Preventing Flooding What prevents a malicious host from inserting junk data? Answer: not much Kludge: capacity limit (e.g., 0.1%) Per source/destination pair How real is this problem? Refresh requirement allows recovery
Review of Goals and Assumptions Data integrity Using secure hashes Limited support for updates Distribute load; avoid hot spots High availability Self managing/repairing Exploit locality Assumptions Many participants Heterogeneous nodes File System Layer Dhash/ Chord Layer
Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion
CFS Project Status Working prototype software Some abuse prevention mechanisms SFSRO file system client Guarantees authenticity of files, updates, etc. Some measurements on RON testbed Simulation results to test scalability
Experimental Setup (12 nodes) One virtual node per host 8Kbyte blocks RPCs use UDP To vu.nl lulea.se ucl.uk To kaist.kr,.ve Caching turned off Proximity routing turned off
CFS Fetch Time for 1MB File Average over the 12 hosts No replication, no caching; 8 KByte blocks Fetch Time (Seconds) Prefetch Window (KBytes)
Distribution of Fetch Times for 1MB Fraction of Fetches Time (Seconds) 8 Kbyte Prefetch 24 Kbyte Prefetch40 Kbyte Prefetch
CFS Fetch Time vs. Whole File TCP Fraction of Fetches Time (Seconds) 40 Kbyte Prefetch Whole File TCP
Robustness vs. Failures Failed Lookups (Fraction) Failed Nodes (Fraction) (1/2) 6 is Six replicas per block;
Much Related Work SFSRO (Secure file system, read only) Freenet Napster Gnutella PAST CAN …
Later Work: Read/Write Filesystems Ivy and Eliot Extend the idea to read/write file systems Full NFS or AFS-like semantics All the nasty issues of distributed file systems Consistency Partitioning Conflicts Oceanstore (earlier work, actually)
Outline File system structure Review: Chord distributed hashing DHash block management CFS evaluation Brief overview of Oceanstore Discussion
CFS Summary (from their talk) CFS provides peer-to-peer r/o storage Structure: DHash and Chord It is efficient, robust, and load-balanced It uses block-level distribution The prototype is as fast as whole-file TCP
OceanStore (Similarities to CFS) Ambitious global-scale storage “utility” built from an overlay network Similar goals: Distributed, replicated, highly available Explicit caching Similar solutions: Hash-based identifiers Multi-hop overlay-network routing
OceanStore (Differences) “Utility” model Implies some managed layers Handles explicit motivation of participants Many applications/interfaces Filesystem, web content, PDA sync.,
OceanStore (More Differences) Explicit support for updates Built into system at almost every layer Uses Byzantine commit Requires “maintained” inner ring Updates on encrypted data Plaxton tree based routing Fast, probabilistic method Slower, reliable method Erasure codes limit replication overhead
Aside: Research Approach OceanStore Original ASPLOS “vision paper” Paints the big picture Spent 3+ fleshing it out Chord Evolution from Chord -> CFS -> Ivy More follow-on research by others
Issues and Discussion Kludge or pragmatic? CFS’s replication CFS’s caching (is “closer” server better?) CFS’s multiple virtual servers CFS’s Locality-based routing Are “root” blocks cacheable? Bandwidth is reasonable… …what about latency? Block-based or File-based? How separate are the layers, really?