Download presentation
Presentation is loading. Please wait.
Published byKathy Howse Modified over 9 years ago
1
Pastiche: Making Backup Cheap and Easy
2
Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB) ~$4/GB/Month (now $0.02/GB) Small-scale solutions dominated by administrative efforts Small-scale solutions dominated by administrative efforts Large-scale solutions require centralized management Large-scale solutions require centralized management [Google 2014]
3
Pastiche Observation 1: disk is no longer full Observation 1: disk is no longer full Can use excess capacity for efficient, effective, and administration-free backup Can use excess capacity for efficient, effective, and administration-free backup Use untrusted machines to perform backup services Use untrusted machines to perform backup services Need replication for reliability Need replication for reliability Need to balance locality and reliability Need to balance locality and reliability
4
Pastiche Observation 2: Much of the data on a given machine is not unique Observation 2: Much of the data on a given machine is not unique Office 2000: 217 MB footprint Office 2000: 217 MB footprint Different installations are largely the same Different installations are largely the same It’s exploitation can achieve storage savings It’s exploitation can achieve storage savings
5
Pastiche Built on three pieces of research Built on three pieces of research Pastry: Peer-to-peer, self-administering, scalable routing Pastry: Peer-to-peer, self-administering, scalable routing Content-based indexing: easy discovering of redundant data Content-based indexing: easy discovering of redundant data Convergent encryption: use the same encrypted representation without sharing keys Convergent encryption: use the same encrypted representation without sharing keys
6
Challenges How to discover backup buddies without a centralized directory? How to discover backup buddies without a centralized directory? How can nodes reuse their own state to backup others? How can nodes reuse their own state to backup others? How can nodes restore files/machines without requiring administrative intervention? How can nodes restore files/machines without requiring administrative intervention? How can nodes detect unfaithful buddies? How can nodes detect unfaithful buddies?
7
Basic Idea Summarize storage content with abstracts Summarize storage content with abstracts Use abstracts to locate buddies Use abstracts to locate buddies A skeleton tree is used to represent and restore an entire file system A skeleton tree is used to represent and restore an entire file system Periodic queries of buddies for stored data Periodic queries of buddies for stored data
8
Enabling Technologies Peer-to-peer routing Peer-to-peer routing Content-based indexing Content-based indexing Convergent encryption Convergent encryption
9
Peer-to-Peer Routing Pastry: scalable, self-organizing, routing and object location infrastructure Pastry: scalable, self-organizing, routing and object location infrastructure Each node has a nodeID Each node has a nodeID IDs are uniformly distributed in the ID space IDs are uniformly distributed in the ID space A proximity metric to measure the distance between two IDs A proximity metric to measure the distance between two IDs
10
More on Pastry Each node maintains three sets of states Each node maintains three sets of states Leaf set Leaf set Closest nodes in terms of nodeIDs Closest nodes in terms of nodeIDs Neighborhood set Neighborhood set Closest nodes in terms of of the proximity metric Closest nodes in terms of of the proximity metric Routing table Routing table Prefix routing Prefix routing
11
Prefix Routing In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID Destination: 1230 Destination: 1230 Current NodeID: 1023 Current NodeID: 1023 Next Hop: 12-- Next Hop: 12--
12
Pastiche’s Use of Pastry Uses two separate Pastry overlay networks during buddy discovery Uses two separate Pastry overlay networks during buddy discovery Once a node is discovered, traffic is send directly via IP Once a node is discovered, traffic is send directly via IP Pastiche adds two mechanisms Pastiche adds two mechanisms Lighthouse sweep to discover distinct Pastry nodes Lighthouse sweep to discover distinct Pastry nodes Distance metric based on the FS contents Distance metric based on the FS contents
13
Content-Based Indexing Goal: identify file regions for sharing Goal: identify file regions for sharing Use Rabin fingerprints Use Rabin fingerprints A fingerprint is generated for each overlapping k-byte substring in a file A fingerprint is generated for each overlapping k-byte substring in a file If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor Anchors divide files into chunks; each chunk is associated with a secure hash value Anchors divide files into chunks; each chunk is associated with a secure hash value
14
Sharing with Confidentiality Sharing encrypted data without sharing keys Sharing encrypted data without sharing keys Need to have a single encrypted representation Need to have a single encrypted representation For the ease of comparisons For the ease of comparisons Use convergent encryption Use convergent encryption
15
Convergent Encryption So…say…how do you share a door without sharing its corresponding keys? So…say…how do you share a door without sharing its corresponding keys?
16
Convergent Encryption How about use different safes to stores those keys? How about use different safes to stores those keys?
17
Convergent Encryption And use different keys to access those keys And use different keys to access those keys
18
Implications of the Use of Convergent Encryption If a backup node is not a participating group If a backup node is not a participating group Cannot decrypt the data Cannot decrypt the data If not, a backup node knows the node also stores that data If not, a backup node knows the node also stores that data Information leak vs. storage efficiency Information leak vs. storage efficiency
19
Design Pastiche data is stored in chunks Pastiche data is stored in chunks Chunk boundaries determined by content- based indexing Chunk boundaries determined by content- based indexing Encrypted with convergent encryption Encrypted with convergent encryption Chunks carry owner lists Chunks carry owner lists
20
Design When a newly written file is closed, it is scheduled for chunking When a newly written file is closed, it is scheduled for chunking If a chunk already exists, the local host is added to the owner list If a chunk already exists, the local host is added to the owner list If not, encrypt the chunk and write it out If not, encrypt the chunk and write it out Chunking and writing deferred to avoid short-lived files Chunking and writing deferred to avoid short-lived files
21
Design Chunks are immutable Chunks are immutable When a file is written, its set of chunk may change When a file is written, its set of chunk may change A chunk is not deleted until the last reference to it is removed A chunk is not deleted until the last reference to it is removed
22
Abstracts: Finding Redundancy An ideal backup buddy is one that holds a superset of the new machine’s data An ideal backup buddy is one that holds a superset of the new machine’s data To find it, send the full signature (hashes) of the new node to candidate buddies To find it, send the full signature (hashes) of the new node to candidate buddies However, we need to transfer 1.3MB per GB of stored data However, we need to transfer 1.3MB per GB of stored data Solution: Abstracts—transfer only a random subset of signatures Solution: Abstracts—transfer only a random subset of signatures
23
Compare one disk to another Node1 signatureNode2 signature 98 208 731 1155 46 7452153 3318 67 26179313 7719 98 8 731 1155 46 16 3515 24 98 208 731 1155 46 7 4521 5316 3515 24 3318 67 2613 98 8 731 1155 46 16 3515 24 16715 Node1 abstract 13
24
Overlays: Finding a Set of Buddies A desirable buddy should have A desirable buddy should have A substantial overlap A substantial overlap Physically nearby (with at least one far away to survive geographically correlated failures) Physically nearby (with at least one far away to survive geographically correlated failures)
25
Applied Use of Pastry Pastiche uses two Pastry overlays to facilitate buddy discovery Pastiche uses two Pastry overlays to facilitate buddy discovery One for network proximity One for network proximity One for file system overlap One for file system overlap Coverage—the fraction of overlapping chunks stored on a site Coverage—the fraction of overlapping chunks stored on a site
26
Security Problems A malicious node can A malicious node can Under-report coverage to avoid being chosen as a buddy Under-report coverage to avoid being chosen as a buddy Over-report coverage to attract clients just to discard their chunks Over-report coverage to attract clients just to discard their chunks
27
Backup Protocol A Pastiche node has full control over the backup schedule A Pastiche node has full control over the backup schedule A snapshot consists of three things A snapshot consists of three things Chunks to be added Chunks to be added Chunks to be removed Chunks to be removed Metadata of those chunks Metadata of those chunks
28
Restoration A Pastiche node retains its archive skeleton, so performing partial restores is easy A Pastiche node retains its archive skeleton, so performing partial restores is easy To recover the whole machine, a node has to obtain its root node from one of the backup machines first… To recover the whole machine, a node has to obtain its root node from one of the backup machines first…
29
Detecting Failure and Malice A node randomly requests data from its buddies A node randomly requests data from its buddies Can bound probability of having failures and malicious nodes undetected Can bound probability of having failures and malicious nodes undetected
30
Preventing Greed Someone can store things everywhere Someone can store things everywhere Need to institute distributed quota Need to institute distributed quota Very difficult Very difficult Some proposed solutions Some proposed solutions Each node monitors the overall storage costs imposed by its backup clients Each node monitors the overall storage costs imposed by its backup clients Problem: Sybil attacks (forge many entities that consumes little storage) Problem: Sybil attacks (forge many entities that consumes little storage)
31
Preventing Greed Force each node to solve puzzles proportional to storage consumption Force each node to solve puzzles proportional to storage consumption Problems: Problems: Needless expensive Needless expensive Storage is traded against something other than storage Storage is traded against something other than storage Heterogeneous computing power Heterogeneous computing power
32
Preventing Greed Electronic currency Electronic currency Problems: Problems: Need to add atomic currency transactions Need to add atomic currency transactions Complicated Complicated
33
Implementation Chunkstore file system Chunkstore file system Backup daemon Backup daemon
34
Performance Overhead
35
The Chance of Finding Buddies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.