Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pastiche: Making Backup Cheap and Easy

Similar presentations


Presentation on theme: "Pastiche: Making Backup Cheap and Easy"— Presentation transcript:

1 Pastiche: Making Backup Cheap and Easy

2 Introduction Backup is cumbersome and expensive ~$4/GB/Month
Small-scale solutions dominated by administrative efforts Large-scale solutions require centralized management

3 Pastiche Observation 1: disk is no longer full
Can use excess capacity for efficient, effective, and administration-free backup Use untrusted machines to perform backup services Need replication for reliability Need to balance locality and reliability

4 Pastiche Observation 2: Much of the data on a given machine is not unique Office 2000: 217 MB footprint Different installations are largely the same It’s exploitation can achieve storage savings

5 Pastiche Built on three pieces of research
Pastry: Peer-to-peer, self-administering, scalable routing Content-based indexing: easy discovering of redundant data Convergent encryption: use the same encrypted representation without sharing keys

6 Challenges How to discover backup buddies without a centralized directory? How can nodes reuse their own state to backup others? How can nodes restore files/machines without requiring administrative intervention? How can nodes detect unfaithful buddies?

7 Basic Idea Summarize storage content with abstracts
Use abstracts to locate buddies A skeleton tree is used to represent and restore an entire file system Periodic queries of buddies for stored data

8 Enabling Technologies
Peer-to-peer routing Content-based indexing Convergent encryption

9 Peer-to-Peer Routing Pastry: scalable, self-organizing, routing and object location infrastructure Each node has a nodeID IDs are uniformly distributed in the ID space A proximity metric to measure the distance between two IDs

10 More on Pastry Each node maintains three sets of states Leaf set
Closest nodes in terms of nodeIDs Neighborhood set Closest nodes in terms of of the proximity metric Routing table Prefix routing

11 Prefix Routing In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID Destination: 1230 Current NodeID: 1023 Next Hop: 12--

12 Pastiche’s Use of Pastry
Uses two separate Pastry overlay networks during buddy discovery Once a node is discovered, traffic is send directly via IP Pastiche adds two mechanisms Lighthouse sweep to discover distinct Pastry nodes Distance metric based on the FS contents

13 Content-Based Indexing
Goal: identify file regions for sharing Use Rabin fingerprints A fingerprint is generated for each overlapping k-byte substring in a file If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor Anchors divide files into chunks; each chunk is associated with a secure hash value

14 Sharing with Confidentiality
Sharing encrypted data without sharing keys Need to have a single encrypted representation For the ease of comparisons Use convergent encryption

15 Convergent Encryption
So…say…how do you share a door without sharing its corresponding keys?

16 Convergent Encryption
How about use different safes to stores those keys?

17 Convergent Encryption
And use different keys to access those keys

18 Implications of the Use of Convergent Encryption
If a backup node is not a participating group Cannot decrypt the data If not, a backup node knows the node also stores that data Information leak vs. storage efficiency

19 Design Pastiche data is stored in chunks
Chunk boundaries determined by content-based indexing Encrypted with convergent encryption Chunks carry owner lists

20 Design When a newly written file is closed, it is scheduled for chunking If a chunk already exists, the local host is added to the owner list If not, encrypt the chunk and write it out Chunking and writing deferred to avoid short-lived files

21 Design Chunks are immutable
When a file is written, its set of chunk may change A chunk is not deleted until the last reference to it is removed

22 Abstracts: Finding Redundancy
An ideal backup buddy is one that holds a superset of the new machine’s data To find it, send the full signature (hashes) of the new node to candidate buddies However, we need to transfer 1.3MB per GB of stored data Solution: Abstracts—transfer only a random subset of signatures

23 Compare one disk to another
Node1 signature Node2 signature 98 98 73 73 1 1 46 46 98 98 73 73 1 1 46 46 20 67 8 8 11 11 55 55 20 67 8 8 11 11 55 55 26 7 13 53 16 45 21 24 7 26 13 53 16 17 93 24 35 33 15 18 45 16 21 24 77 35 15 19 35 33 18 15 Node1 abstract 1 67 13 15

24 Overlays: Finding a Set of Buddies
A desirable buddy should have A substantial overlap Physically nearby (with at least one far away to survive geographically correlated failures)

25 Applied Use of Pastry Pastiche uses two Pastry overlays to facilitate buddy discovery One for network proximity One for file system overlap Coverage—the fraction of overlapping chunks stored on a site

26 Security Problems A malicious node can
Under-report coverage to avoid being chosen as a buddy Over-report coverage to attract clients just to discard their chunks

27 Backup Protocol A Pastiche node has full control over the backup schedule A snapshot consists of three things Chunks to be added Chunks to be removed Metadata of those chunks

28 Restoration A Pastiche node retains its archive skeleton, so performing partial restores is easy To recover the whole machine, a node has to obtain its root node from one of the backup machines first…

29 Detecting Failure and Malice
A node randomly requests data from its buddies Can bound probability of having failures and malicious nodes undetected

30 Preventing Greed Someone can store things everywhere
Need to institute distributed quota Very difficult Some proposed solutions Each node monitors the overall storage costs imposed by its backup clients Problem: Sybil attacks (forge many entities that consumes little storage)

31 Preventing Greed Force each node to solve puzzles proportional to storage consumption Problems: Needless expensive Storage is traded against something other than storage Heterogeneous computing power

32 Preventing Greed Electronic currency Problems:
Need to add atomic currency transactions Complicated

33 Implementation Chunkstore file system Backup daemon

34 Performance Overhead

35 The Chance of Finding Buddies


Download ppt "Pastiche: Making Backup Cheap and Easy"

Similar presentations


Ads by Google