Pastiche: Making Backup Cheap and Easy

Slides:



Advertisements
Similar presentations
Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge
Advertisements

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Scalable Content-Addressable Network Lintao Liu
TAP: A Novel Tunneling Approach for Anonymity in Structured P2P Systems Yingwu Zhu and Yiming Hu University of Cincinnati.
Pastiche: Making Backup Cheap and Easy. Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB)
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Antony Rowstron, Peter Druschel Presented by: Cristian Borcea.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
SplitStream: High- Bandwidth Multicast in Cooperative Environments Monica Tudora.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Peer-to-Peer Backup Presented by Yingwu Zhu. Overview A short introduction to backup Peer-to-Peer backup.
Secure routing for structured peer-to-peer overlay networks (by Castro et al.) Shariq Rizvi CS 294-4: Peer-to-Peer Systems.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Wide-area cooperative storage with CFS
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Project Mimir A Distributed Filesystem Uses Rateless Erasure Codes for Reliability Uses Pastry’s Multicast System Scribe for Resource discovery and Utilization.
Pastiche: Making Backup Cheap and Easy Presented by: Boon Thau Loo CS294-4.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Mobile Ad-hoc Pastry (MADPastry) Niloy Ganguly. Problem of normal DHT in MANET No co-relation between overlay logical hop and physical hop – Low bandwidth,
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.
Chapter 17 Domain Name System
An efficient secure distributed anonymous routing protocol for mobile and wireless ad hoc networks Authors: A. Boukerche, K. El-Khatib, L. Xu, L. Korba.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
UbiStore: Ubiquitous and Opportunistic Backup Architecture. Feiselia Tan, Sebastien Ardon, Max Ott Presented by: Zainab Aljazzaf.
Security Michael Foukarakis – 13/12/2004 A Survey of Peer-to-Peer Security Issues Dan S. Wallach Rice University,
Efficient Peer to Peer Keyword Searching Nathan Gray.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 Kyung Hee University Chapter 18 Domain Name System.
P2p file storage and distribution Team: Brian Smith, Daniel Suskin, Dylan Nunley, Forrest Vines Mentor: Brendan Burns.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.
Pastry Antony Rowstron and Peter Druschel Presented By David Deschenes.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M
Data Management on Opportunistic Grids
Chapter 25 Domain Name System.
CS 268: Lecture 22 (Peer-to-Peer Networks)
Pastry Scalable, decentralized object locations and routing for large p2p systems.
Revisiting Ethernet: Plug-and-play made scalable and efficient
Distributed Hash Tables
Controlling the Cost of Reliability in Peer-to-Peer Overlays
(slides by Nick Feamster)
Net 323 D: Networks Protocols
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
EE 122: Peer-to-Peer (P2P) Networks
Providing Secure Storage on the Internet
THE GOOGLE FILE SYSTEM.
Applications (2) Outline Overlay Networks Peer-to-Peer Networks.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric
SPINE: Surveillance protection in the network Elements
Presentation transcript:

Pastiche: Making Backup Cheap and Easy

Introduction Backup is cumbersome and expensive ~$4/GB/Month Small-scale solutions dominated by administrative efforts Large-scale solutions require centralized management

Pastiche Observation 1: disk is no longer full Can use excess capacity for efficient, effective, and administration-free backup Use untrusted machines to perform backup services Need replication for reliability Need to balance locality and reliability

Pastiche Observation 2: Much of the data on a given machine is not unique Office 2000: 217 MB footprint Different installations are largely the same It’s exploitation can achieve storage savings

Pastiche Built on three pieces of research Pastry: Peer-to-peer, self-administering, scalable routing Content-based indexing: easy discovering of redundant data Convergent encryption: use the same encrypted representation without sharing keys

Challenges How to discover backup buddies without a centralized directory? How can nodes reuse their own state to backup others? How can nodes restore files/machines without requiring administrative intervention? How can nodes detect unfaithful buddies?

Basic Idea Summarize storage content with abstracts Use abstracts to locate buddies A skeleton tree is used to represent and restore an entire file system Periodic queries of buddies for stored data

Enabling Technologies Peer-to-peer routing Content-based indexing Convergent encryption

Peer-to-Peer Routing Pastry: scalable, self-organizing, routing and object location infrastructure Each node has a nodeID IDs are uniformly distributed in the ID space A proximity metric to measure the distance between two IDs

More on Pastry Each node maintains three sets of states Leaf set Closest nodes in terms of nodeIDs Neighborhood set Closest nodes in terms of of the proximity metric Routing table Prefix routing

Prefix Routing In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID Destination: 1230 Current NodeID: 1023 Next Hop: 12--

Pastiche’s Use of Pastry Uses two separate Pastry overlay networks during buddy discovery Once a node is discovered, traffic is send directly via IP Pastiche adds two mechanisms Lighthouse sweep to discover distinct Pastry nodes Distance metric based on the FS contents

Content-Based Indexing Goal: identify file regions for sharing Use Rabin fingerprints A fingerprint is generated for each overlapping k-byte substring in a file If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor Anchors divide files into chunks; each chunk is associated with a secure hash value

Sharing with Confidentiality Sharing encrypted data without sharing keys Need to have a single encrypted representation For the ease of comparisons Use convergent encryption

Convergent Encryption So…say…how do you share a door without sharing its corresponding keys?

Convergent Encryption How about use different safes to stores those keys?

Convergent Encryption And use different keys to access those keys

Implications of the Use of Convergent Encryption If a backup node is not a participating group Cannot decrypt the data If not, a backup node knows the node also stores that data Information leak vs. storage efficiency

Design Pastiche data is stored in chunks Chunk boundaries determined by content-based indexing Encrypted with convergent encryption Chunks carry owner lists

Design When a newly written file is closed, it is scheduled for chunking If a chunk already exists, the local host is added to the owner list If not, encrypt the chunk and write it out Chunking and writing deferred to avoid short-lived files

Design Chunks are immutable When a file is written, its set of chunk may change A chunk is not deleted until the last reference to it is removed

Abstracts: Finding Redundancy An ideal backup buddy is one that holds a superset of the new machine’s data To find it, send the full signature (hashes) of the new node to candidate buddies However, we need to transfer 1.3MB per GB of stored data Solution: Abstracts—transfer only a random subset of signatures

Compare one disk to another Node1 signature Node2 signature 98 98 73 73 1 1 46 46 98 98 73 73 1 1 46 46 20 67 8 8 11 11 55 55 20 67 8 8 11 11 55 55 26 7 13 53 16 45 21 24 7 26 13 53 16 17 93 24 35 33 15 18 45 16 21 24 77 35 15 19 35 33 18 15 Node1 abstract 1 67 13 15

Overlays: Finding a Set of Buddies A desirable buddy should have A substantial overlap Physically nearby (with at least one far away to survive geographically correlated failures)

Applied Use of Pastry Pastiche uses two Pastry overlays to facilitate buddy discovery One for network proximity One for file system overlap Coverage—the fraction of overlapping chunks stored on a site

Security Problems A malicious node can Under-report coverage to avoid being chosen as a buddy Over-report coverage to attract clients just to discard their chunks

Backup Protocol A Pastiche node has full control over the backup schedule A snapshot consists of three things Chunks to be added Chunks to be removed Metadata of those chunks

Restoration A Pastiche node retains its archive skeleton, so performing partial restores is easy To recover the whole machine, a node has to obtain its root node from one of the backup machines first…

Detecting Failure and Malice A node randomly requests data from its buddies Can bound probability of having failures and malicious nodes undetected

Preventing Greed Someone can store things everywhere Need to institute distributed quota Very difficult Some proposed solutions Each node monitors the overall storage costs imposed by its backup clients Problem: Sybil attacks (forge many entities that consumes little storage)

Preventing Greed Force each node to solve puzzles proportional to storage consumption Problems: Needless expensive Storage is traded against something other than storage Heterogeneous computing power

Preventing Greed Electronic currency Problems: Need to add atomic currency transactions Complicated

Implementation Chunkstore file system Backup daemon

Performance Overhead

The Chance of Finding Buddies