Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems Robert Morris Joint work with F. Kaashoek, D. Karger, I. Stoica, H. Balakrishnan,

Slides:



Advertisements
Similar presentations
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Advertisements

Peer to Peer and Distributed Hash Tables
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
CHORD: A Peer-to-Peer Lookup Service CHORD: A Peer-to-Peer Lookup Service Ion StoicaRobert Morris David R. Karger M. Frans Kaashoek Hari Balakrishnan Presented.
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Distributed Hash Tables CPE 401 / 601 Computer Network Systems Modified from Ashwin Bharambe and Robert Morris.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up.
Robert Morris, M. Frans Kaashoek, David Karger, Hari Balakrishnan, Ion Stoica, David Liben-Nowell, Frank Dabek Chord: A scalable peer-to-peer look-up protocol.
Distributed Hash Tables: Chord Brad Karp (with many slides contributed by Robert Morris) UCL Computer Science CS M038 / GZ06 27 th January, 2009.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion StoicaRobert Morris David Liben-NowellDavid R. Karger M. Frans KaashoekFrank.
Ivy: A Read/Write P2P File System Athicha Muthitacharoan, Robert Morris, Thomer Gil and Benjie Chen Presented by Rachel Rubin CS 294-4, Fall 2003.
1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
What is a P2P system? A distributed system architecture: No centralized control Nodes are symmetric in function Large number of unreliable nodes Enabled.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Introduction to Peer-to-Peer (P2P) Systems Gabi Kliot - Computer Science Department, Technion Concurrent and Distributed Computing Course 28/06/2006 The.
Scalable Resource Information Service for Computational Grids Nian-Feng Tzeng Center for Advanced Computer Studies University of Louisiana at Lafayette.
Ivy: A Read/Write Peer-to- Peer File System A.Muthitacharoen, R. Morris, T. Gil, and B. Chen Presented by: Matthew Allen.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Distributed Lookup Systems
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
A Peer-to-Peer File System OSCAR LAB. Overview A short introduction to peer-to-peer (P2P) Systems Ivy: a read/write P2P file system (OSDI’02)
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation.
Wide-area cooperative storage with CFS
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Lecture 10 Naming services for flat namespaces. EECE 411: Design of Distributed Software Applications Logistics / reminders Project Send Samer and me.
A Backup System built from a Peer-to-Peer Distributed Hash Table Russ Cox joint work with Josh Cates, Frank Dabek, Frans Kaashoek, Robert Morris,
CSE 461 University of Washington1 Topic Peer-to-peer content delivery – Runs without dedicated infrastructure – BitTorrent as an example Peer.
DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications Frans Kaashoek Joint work with: H. Balakrishnan, P.
Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.
Wide-area cooperative storage with CFS Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Ivy: A Read/Write Peer-to-Peer File System A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen In Proceedings of OSDI ‘ Presenter : Chul Lee.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Vincent Matossian September 21st 2001 ECE 579 An Overview of Decentralized Discovery mechanisms.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Dr. Yingwu Zhu.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
Peer-to-peer computing research: a fad? Frans Kaashoek Joint work with: H. Balakrishnan, P. Druschel, J. Hellerstein, D. Karger, R.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 JTE HPC/FS Pastis: a peer-to-peer file system for persistant large-scale storage Jean-Michel Busca Fabio Picconi Pierre Sens LIP6, Université Paris 6.
Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]
Peer-to-Peer (P2P) File Systems. P2P File Systems CS 5204 – Fall, Peer-to-Peer Systems Definition: “Peer-to-peer systems can be characterized as.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
Peer-to-Peer Information Systems Week 12: Naming
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Peer-to-Peer (P2P) File Systems
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
Peer-to-Peer Storage Systems
MIT LCS Proceedings of the 2001 ACM SIGCOMM Conference
Consistent Hashing and Distributed Hash Table
Peer-to-Peer Information Systems Week 12: Naming
#02 Peer to Peer Networking
Presentation transcript:

Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems Robert Morris Joint work with F. Kaashoek, D. Karger, I. Stoica, H. Balakrishnan, F. Dabek, T. Gil, B. Chen, and A. Muthitacharoen

What is a P2P system? A distributed system architecture: No centralized control Nodes are symmetric in function Large number of unreliable nodes Enabled by technology improvements Node Internet

The promise of P2P computing High capacity through parallelism: Many disks Many network connections Many CPUs Reliability: Many replicas Geographic distribution Automatic configuration Useful in public and proprietary settings

Distributed hash table (DHT) Distributed hash table Distributed application get (key) data node …. put(key, data) Lookup service lookup(key)node IP address Application may be distributed over many nodes DHT distributes data storage over many nodes (Ivy) (DHash) (Chord)

A DHT has a good interface Put(key, value) and get(key)  value Call a key/value pair a “block” API supports a wide range of applications DHT imposes no structure/meaning on keys Key/value pairs are persistent and global Can store keys in other DHT values And thus build complex data structures

A DHT makes a good shared infrastructure Many applications can share one DHT service Much as applications share the Internet Eases deployment of new applications Pools resources from many participants Efficient due to statistical multiplexing Fault-tolerant due to geographic distribution

Many recent DHT-based projects File sharing [CFS, OceanStore, PAST, …] Web cache [Squirrel,..] Backup store [Pastiche] Censor-resistant stores [Eternity, FreeNet,..] DB query and indexing [Hellerstein, …] Event notification [Scribe] Naming systems [ChordDNS, Twine,..] Communication primitives [I3, …] Common thread: data is location-independent

The lookup problem Internet N1N1 N2N2 N3N3 N6N6 N5N5 N4N4 Publisher Put (Key=“title” Value=file data…) Client Get(key=“title”) ? At the heart of all DHTs

Centralized lookup (Napster) Client Lookup(“title”) N6N6 N9N9 N7N7 DB N8N8 N3N3 N2N2 N1N1 SetLoc(“title”, N4) Simple, but O( N ) state and a single point of failure Key=“title” Value=file data… N4N4

Flooded queries (Gnutella) N4N4 Client N6N6 N9N9 N7N7 N8N8 N3N3 N2N2 N1N1 Robust, but worst case O( N ) messages per lookup Key=“title” Value=file data… Lookup(“title”)

Routed queries (Freenet, Chord, etc.) N4N4 Publisher Client N6N6 N9N9 N7N7 N8N8 N3N3 N2N2 N1N1 Lookup(“title”) Key=“title” Value=file data…

Chord lookup algorithm properties Interface: lookup(key)  IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per node Robust: survives massive failures Simple to analyze

Chord IDs Key identifier = SHA-1(key) Node identifier = SHA-1(IP address) SHA-1 distributes both uniformly How to map key IDs to node IDs?

Chord Hashes a Key to its Successor N32 N10 N100 N80 N60 Circular ID Space Successor: node with next highest ID K33, K40, K52 K11, K30 K5, K10 K65, K70 K100 Key ID Node ID

Basic Lookup N32 N10 N5 N20 N110 N99 N80 N60 N40 “Where is key 50?” “Key 50 is At N60” Lookups find the ID’s predecessor Correct if successors are correct

Successor Lists Ensure Robust Lookup N32 N10 N5 N20 N110 N99 N80 N60 Each node remembers r successors Lookup can skip over dead nodes to find blocks N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, , 110, 5 110, 5, 10 5, 10, 20

Chord “Finger Table” Accelerates Lookups N80 ½ ¼ 1/8 1/16 1/32 1/64 1/128

Chord lookups take O(log N) hops N32 N10 N5 N20 N110 N99 N80 N60 Lookup(K19) K19

Simulation Results: ½ log 2 (N) Number of Nodes Average Messages per Lookup Error bars mark 1 st and 99 th percentiles

DHash Properties Builds key/value storage on Chord Replicates blocks for availability Caches blocks for load balance Authenticates block contents

DHash Replicates blocks at r successors N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Replicas are easy to find if successor fails Hashed node IDs ensure independent failure

DHash Data Authentication Two types of DHash blocks: Content-hash: key = SHA-1(data) Public-key: key is a public key, data is signed by that key DHash servers verify before accepting Clients verify result of get(key)

Ivy File System Properties Traditional file-system interface (almost) Read/write for multiple users No central components Trusted service from untrusted components

Straw Man: Shared Structure Standard meta-data in DHT blocks? What about locking during updates? Requires 100% trust Root Inode Directory Block File3 Inode File2 Inode File1 Inode File3 Data

Ivy Design Overview Log structured Avoids in-place updates Each participant writes only its own log Avoids concurrent updates to DHT data Each participant reads all logs Private snapshots for speed

Ivy Software Structure App NFS Client Ivy Server Internet DHT Node DHT Node DHT Node kernel user system calls NFS RPCs

One Participant’s Ivy Log Log Head Record 1 Record 2 Record 3 Mutable public-key signed DHT block Immutable content-hash DHT blocks Log-head contains DHT key of most recent record Each record contains DHT key of previous record

Ivy I-Numbers Every file has a unique I-Number Log records contain I-Numbers Ivy returns I-Numbers to NFS client NFS requests contain I-Numbers In the NFS file handle

NFS/Ivy Communication Example Local NFS ClientLocal Ivy Server LOOKUP(“d”, I-Num=1000) I-Num=1000 CREATE(“aaa”, I-Num=1000) I-Num=9956 WRITE(“hello”, 0, I-Num=9956) OK echo hello > d/aaa LOOKUP finds the I-Number of directory “d” CREATE creates file “aaa” in directory “d” WRITE writes “hello” at offset 0 in file “aaa”

Log Records for File Creation Type: Create I-num: 9956 Type: Link Dir I-num: 1000 File I-num: 9956 Name: “aaa” Type: Write I-num: 9956 Offset: 0 Data: “hello” … Log Head A log record describes a change to the file system

Scanning an Ivy Log Type: Link Dir I-num: 1000 File I-num: 9956 Name: “aaa” Type: Link Dir I-num: 1000 File I-num: 9876 Name: “bbb” Type: Remove Dir I-num: 1000 Name: “aaa” A scan follows the log backwards in time LOOKUP(name, dir I-num): last Link, but stop at Remove READDIR(dir I-num): accumulate Links, minus Removes

Finding Other Logs: The View Block Log Head 1 Log Head 2 Log Head 3 View block is immutable (content-hash DHT block) View block’s DHT key names the file system Example: /ivy/37ae5ff901/aaa View Block Pub Key 1 Pub Key 2 Pub Key 3

Reading Multiple Logs Log Head 1 Log Head 2 Problem: how to interleave log records? Red numbers indicate real time of record creation But we cannot count on synchronized clocks

Vector Timestamps Encode Partial Order Log Head 1 Log Head Each log record contains vector of DHT keys One vector entry per log Entry points to log’s most recent record

Snapshots Scanning the logs is slow Each participant keeps a private snapshot Log pointers as of snapshot creation time Table of all I-nodes Each file’s attributes and contents Reflects all participants’ logs Participant updates periodically from logs All snapshots share storage in the DHT

Simultaneous Updates Ordinary file servers serialize all updates Ivy does not Most cases are not a problem: Simultaneous writes to the same file Simultaneous creation of different files in same directory Problem case: Unlink(“a”) and rename(“a”, “b”) at same time Ivy correctly lets only one take effect But it may return “success” status for both

Integrity Can attacker corrupt my files? Not unless attacker is in my Ivy view What if a participant goes bad? Others can ignore participant’s whole log Ignore entries after some date Ignore just harmful records

Ivy Performance Half as fast as NFS on LAN and WAN Scalable w/ # of participants These results were taken yesterday…

Local Benchmark Configuration App NFS Client Ivy Server DHash Server One log One DHash server Ivy+DHash all on one host

Ivy Local Performance on MAB PhaseIvyNFS Mkdir Write Stat Read Compile Total Modified Andrew Benchmark times (seconds) NFS: client – LAN – server 7 seconds doing public key signatures, 3 in DHash

WAN Benchmark Details 4 DHash nodes at MIT, CMU, NYU, Cornell Round-trip times: 8, 14, 22 milliseconds No DHash replication 4 logs One active writer at MIT Whole-file read on open() Whole-file write on close() NFS client/server round-trip time is 14 ms

Ivy WAN Performance PhaseIvyNFS Mkdir Write Stat Read Compile Total seconds fetching log heads, 4 writing log head, 16 inserting log records, 22 in crypto and CPU

Ivy Performance w/ Many Logs MAB on 4-node WAN One active writer Increasing cost due to growing vector timestamps

Related Work DHTs Pastry, CAN, Tapestry File systems LFS, Zebra, xFS Byzantine agreement BFS, OceanStore, Farsite

Summary Exploring use of DHTs as a building block Put/get API is general Provides availability, authentication Harnesses decentralized peer-to-peer groups Case study of DHT use: Ivy Read/write peer-to-peer file system Trustable system from untrusted pieces