Pastiche: Making Backup Cheap and Easy. Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB)

Slides:



Advertisements
Similar presentations
Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge
Advertisements

SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Scalable Content-Addressable Network Lintao Liu
TAP: A Novel Tunneling Approach for Anonymity in Structured P2P Systems Yingwu Zhu and Yiming Hu University of Cincinnati.
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
SplitStream: High- Bandwidth Multicast in Cooperative Environments Monica Tudora.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.
Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Peer-to-Peer Backup Presented by Yingwu Zhu. Overview A short introduction to backup Peer-to-Peer backup.
Secure routing for structured peer-to-peer overlay networks (by Castro et al.) Shariq Rizvi CS 294-4: Peer-to-Peer Systems.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
SkipNet: A Scaleable Overlay Network With Practical Locality Properties Presented by Rachel Rubin CS294-4: Peer-to-Peer Systems By Nicholas Harvey, Michael.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Wide-area cooperative storage with CFS
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Project Mimir A Distributed Filesystem Uses Rateless Erasure Codes for Reliability Uses Pastry’s Multicast System Scribe for Resource discovery and Utilization.
Pastiche: Making Backup Cheap and Easy Presented by: Boon Thau Loo CS294-4.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Mobile Ad-hoc Pastry (MADPastry) Niloy Ganguly. Problem of normal DHT in MANET No co-relation between overlay logical hop and physical hop – Low bandwidth,
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
An efficient secure distributed anonymous routing protocol for mobile and wireless ad hoc networks Authors: A. Boukerche, K. El-Khatib, L. Xu, L. Korba.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
UbiStore: Ubiquitous and Opportunistic Backup Architecture. Feiselia Tan, Sebastien Ardon, Max Ott Presented by: Zainab Aljazzaf.
Information-Centric Networks07a-1 Week 7 / Paper 1 Internet Indirection Infrastructure –Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Security Michael Foukarakis – 13/12/2004 A Survey of Peer-to-Peer Security Issues Dan S. Wallach Rice University,
Efficient Peer to Peer Keyword Searching Nathan Gray.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Storage Management and Caching in PAST A Large-scale persistent peer-to-peer storage utility Presented by Albert Tannous CSE 598D: Storage Systems – Dr.
Strong Security for Distributed File Systems Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
P2p file storage and distribution Team: Brian Smith, Daniel Suskin, Dylan Nunley, Forrest Vines Mentor: Brendan Burns.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.
Pastry Antony Rowstron and Peter Druschel Presented By David Deschenes.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.
Pastry Scalable, decentralized object locations and routing for large p2p systems.
Controlling the Cost of Reliability in Peer-to-Peer Overlays
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
Pastiche: Making Backup Cheap and Easy
THE GOOGLE FILE SYSTEM.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric
Presentation transcript:

Pastiche: Making Backup Cheap and Easy

Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB) ~$4/GB/Month (now $0.02/GB) Small-scale solutions dominated by administrative efforts Small-scale solutions dominated by administrative efforts Large-scale solutions require centralized management Large-scale solutions require centralized management [Google 2014]

Pastiche Observation 1: disk is no longer full Observation 1: disk is no longer full Can use excess capacity for efficient, effective, and administration-free backup Can use excess capacity for efficient, effective, and administration-free backup Use untrusted machines to perform backup services Use untrusted machines to perform backup services Need replication for reliability Need replication for reliability Need to balance locality and reliability Need to balance locality and reliability

Pastiche Observation 2: Much of the data on a given machine is not unique Observation 2: Much of the data on a given machine is not unique Office 2000: 217 MB footprint Office 2000: 217 MB footprint Different installations are largely the same Different installations are largely the same It’s exploitation can achieve storage savings It’s exploitation can achieve storage savings

Pastiche Built on three pieces of research Built on three pieces of research Pastry: Peer-to-peer, self-administering, scalable routing Pastry: Peer-to-peer, self-administering, scalable routing Content-based indexing: easy discovering of redundant data Content-based indexing: easy discovering of redundant data Convergent encryption: use the same encrypted representation without sharing keys Convergent encryption: use the same encrypted representation without sharing keys

Challenges How to discover backup buddies without a centralized directory? How to discover backup buddies without a centralized directory? How can nodes reuse their own state to backup others? How can nodes reuse their own state to backup others? How can nodes restore files/machines without requiring administrative intervention? How can nodes restore files/machines without requiring administrative intervention? How can nodes detect unfaithful buddies? How can nodes detect unfaithful buddies?

Basic Idea Summarize storage content with abstracts Summarize storage content with abstracts Use abstracts to locate buddies Use abstracts to locate buddies A skeleton tree is used to represent and restore an entire file system A skeleton tree is used to represent and restore an entire file system Periodic queries of buddies for stored data Periodic queries of buddies for stored data

Enabling Technologies Peer-to-peer routing Peer-to-peer routing Content-based indexing Content-based indexing Convergent encryption Convergent encryption

Peer-to-Peer Routing Pastry: scalable, self-organizing, routing and object location infrastructure Pastry: scalable, self-organizing, routing and object location infrastructure Each node has a nodeID Each node has a nodeID IDs are uniformly distributed in the ID space IDs are uniformly distributed in the ID space A proximity metric to measure the distance between two IDs A proximity metric to measure the distance between two IDs

More on Pastry Each node maintains three sets of states Each node maintains three sets of states Leaf set Leaf set Closest nodes in terms of nodeIDs Closest nodes in terms of nodeIDs Neighborhood set Neighborhood set Closest nodes in terms of of the proximity metric Closest nodes in terms of of the proximity metric Routing table Routing table Prefix routing Prefix routing

Prefix Routing In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID Destination: 1230 Destination: 1230 Current NodeID: 1023 Current NodeID: 1023 Next Hop: 12-- Next Hop: 12--

Pastiche’s Use of Pastry Uses two separate Pastry overlay networks during buddy discovery Uses two separate Pastry overlay networks during buddy discovery Once a node is discovered, traffic is send directly via IP Once a node is discovered, traffic is send directly via IP Pastiche adds two mechanisms Pastiche adds two mechanisms Lighthouse sweep to discover distinct Pastry nodes Lighthouse sweep to discover distinct Pastry nodes Distance metric based on the FS contents Distance metric based on the FS contents

Content-Based Indexing Goal: identify file regions for sharing Goal: identify file regions for sharing Use Rabin fingerprints Use Rabin fingerprints A fingerprint is generated for each overlapping k-byte substring in a file A fingerprint is generated for each overlapping k-byte substring in a file If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor Anchors divide files into chunks; each chunk is associated with a secure hash value Anchors divide files into chunks; each chunk is associated with a secure hash value

Sharing with Confidentiality Sharing encrypted data without sharing keys Sharing encrypted data without sharing keys Need to have a single encrypted representation Need to have a single encrypted representation For the ease of comparisons For the ease of comparisons Use convergent encryption Use convergent encryption

Convergent Encryption So…say…how do you share a door without sharing its corresponding keys? So…say…how do you share a door without sharing its corresponding keys?

Convergent Encryption How about use different safes to stores those keys? How about use different safes to stores those keys?

Convergent Encryption And use different keys to access those keys And use different keys to access those keys

Implications of the Use of Convergent Encryption If a backup node is not a participating group If a backup node is not a participating group Cannot decrypt the data Cannot decrypt the data If not, a backup node knows the node also stores that data If not, a backup node knows the node also stores that data Information leak vs. storage efficiency Information leak vs. storage efficiency

Design Pastiche data is stored in chunks Pastiche data is stored in chunks Chunk boundaries determined by content- based indexing Chunk boundaries determined by content- based indexing Encrypted with convergent encryption Encrypted with convergent encryption Chunks carry owner lists Chunks carry owner lists

Design When a newly written file is closed, it is scheduled for chunking When a newly written file is closed, it is scheduled for chunking If a chunk already exists, the local host is added to the owner list If a chunk already exists, the local host is added to the owner list If not, encrypt the chunk and write it out If not, encrypt the chunk and write it out Chunking and writing deferred to avoid short-lived files Chunking and writing deferred to avoid short-lived files

Design Chunks are immutable Chunks are immutable When a file is written, its set of chunk may change When a file is written, its set of chunk may change A chunk is not deleted until the last reference to it is removed A chunk is not deleted until the last reference to it is removed

Abstracts: Finding Redundancy An ideal backup buddy is one that holds a superset of the new machine’s data An ideal backup buddy is one that holds a superset of the new machine’s data To find it, send the full signature (hashes) of the new node to candidate buddies To find it, send the full signature (hashes) of the new node to candidate buddies However, we need to transfer 1.3MB per GB of stored data However, we need to transfer 1.3MB per GB of stored data Solution: Abstracts—transfer only a random subset of signatures Solution: Abstracts—transfer only a random subset of signatures

Compare one disk to another Node1 signatureNode2 signature Node1 abstract 13

Overlays: Finding a Set of Buddies A desirable buddy should have A desirable buddy should have A substantial overlap A substantial overlap Physically nearby (with at least one far away to survive geographically correlated failures) Physically nearby (with at least one far away to survive geographically correlated failures)

Applied Use of Pastry Pastiche uses two Pastry overlays to facilitate buddy discovery Pastiche uses two Pastry overlays to facilitate buddy discovery One for network proximity One for network proximity One for file system overlap One for file system overlap Coverage—the fraction of overlapping chunks stored on a site Coverage—the fraction of overlapping chunks stored on a site

Security Problems A malicious node can A malicious node can Under-report coverage to avoid being chosen as a buddy Under-report coverage to avoid being chosen as a buddy Over-report coverage to attract clients just to discard their chunks Over-report coverage to attract clients just to discard their chunks

Backup Protocol A Pastiche node has full control over the backup schedule A Pastiche node has full control over the backup schedule A snapshot consists of three things A snapshot consists of three things Chunks to be added Chunks to be added Chunks to be removed Chunks to be removed Metadata of those chunks Metadata of those chunks

Restoration A Pastiche node retains its archive skeleton, so performing partial restores is easy A Pastiche node retains its archive skeleton, so performing partial restores is easy To recover the whole machine, a node has to obtain its root node from one of the backup machines first… To recover the whole machine, a node has to obtain its root node from one of the backup machines first…

Detecting Failure and Malice A node randomly requests data from its buddies A node randomly requests data from its buddies Can bound probability of having failures and malicious nodes undetected Can bound probability of having failures and malicious nodes undetected

Preventing Greed Someone can store things everywhere Someone can store things everywhere Need to institute distributed quota Need to institute distributed quota Very difficult Very difficult Some proposed solutions Some proposed solutions Each node monitors the overall storage costs imposed by its backup clients Each node monitors the overall storage costs imposed by its backup clients Problem: Sybil attacks (forge many entities that consumes little storage) Problem: Sybil attacks (forge many entities that consumes little storage)

Preventing Greed Force each node to solve puzzles proportional to storage consumption Force each node to solve puzzles proportional to storage consumption Problems: Problems: Needless expensive Needless expensive Storage is traded against something other than storage Storage is traded against something other than storage Heterogeneous computing power Heterogeneous computing power

Preventing Greed Electronic currency Electronic currency Problems: Problems: Need to add atomic currency transactions Need to add atomic currency transactions Complicated Complicated

Implementation Chunkstore file system Chunkstore file system Backup daemon Backup daemon

Performance Overhead

The Chance of Finding Buddies