Download presentation
Presentation is loading. Please wait.
Published byThomasine Morrison Modified over 9 years ago
1
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University
2
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes Implementation Overview Preliminary Results Conclusions
3
Introduction Remote File Synchronization Problem: How to update the outdated version of a file over a network with minimal amount of communication When the versions are very similar, the total data transmitted should be significantly smaller than the file size Machine A Machine B Current Version Outdated Version
4
Common Applications Synchronization of User Files Synchronization between different machines that may only be connected over over a slow network (home and work machine) Both rsync and unison are widely used tools Web and Ftp Site Mirroring Significant similarities between successive versions Including sites distributing new versions of a software rsync is widely used
5
Common Applications Content Distribution Networks File synchronization is a natural approach to for updating content replicated at the network edge Web Access over Slow Links A user revisiting a webpage may already have a previous version in the browser cache It would be desirable to avoid the entire transmission This idea is implemented in rproxy which uses rsync algorithm
6
Problem Formalization We have two files (strings) over some alphabet : f new (current file), f old (outdated file) We have two machines: C (the client), S (the server) connected by a communication link C only has a copy of f old, and S only has a copy of f new Goal: Design a protocol between the parties that result C holding a copy of f new while minimizing the total communication cost
7
Problem Formalization The communication cost should depend on the degree of similarity between the two files The Hamming distance The edit distance The edit distance with block moves We focus mainly on the edit distance with block moves. We assume that each block move operation adds 3 to the distance, while other operations add 1
8
Problem Formalization We focus on single-round protocols between client and server Single-round protocols can be more easily integrated into existing tools currently relying on rsync Multiple rounds are undesirable in many scenarios involving small files or large latencies Multi-round protocols can introduce other complications due to state that may have to be kept at the server for best performance
9
Assumptions The collection consists of unstructured files We are not concerned with issues of consistency in between synchronization steps A simple two-party scenario where it is known which files need to be updated and which is the current version
10
Contributions We describe a new approach to single-round file synchronization based on erasure codes We derive a protocol that communicates at most O(k lg(n) lg(n/k)) bits on files with edit distance with block moves of at most k We derive another practical algorithm and optimized implementation that achieves very promising improvements over rsync
11
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes Implementation Overview Preliminary Results Conclusions
12
A Simple Multi-Round Protocol Runs in a number of rounds In the first round, server partitions the file into blocks of size b max and sends a hash (MD5) for each block Client attempts to match the received hashes to all possible alignments in the outdated file. Client responds with a bit vector to notify the server which of the hashes are understood Server repeats the process for the blocks whose hashes did not find a match Once block size b min is reached, the server sends all the unmatched blocks
13
A Simple Multi-Round Protocol
14
Given two files with edit distance with block moves of k, if we choose b max = next smaller power of 2 of n/k b min = lg(n) hash size = 4lg(n) bits Lemma: If we partition f new into some number of blocks, then at most k of these blocks do not occur in f old On each level, at most k hashes do not find a match The algorithm transmits at most O(k lg(n) lg(n/k) ) bits and correctly updates the file with probability at least 1-1/n
15
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes Implementation Overview Preliminary Results Conclusions
16
An Efficient Single-Round Protocol First, we define complete multi-round algorithm: Sends hashes for all blocks Second, we describe Systematic Erasure Code briefly
17
Erasure Code Erasure Code: Given k source data items of size s, which are encoded into n>k encoded items of same size s. If any n-k of the encoded items are lost they can be recovered A systematic erasure code is the one where the encoded data items consist of k source items plus n-k additional items Figure by Luigi Rizzo
18
An Efficient Single-Round Protocol Any hash value sent in the complete multi-round algorithm that would not be sent in the simple multi- round algorithm is not transmitted
19
An Efficient Single-Round Protocol Any hash value that would be sent by the simple multi- round algorithm is also not sent to the client, but considered lost
20
An Efficient Single-Round Protocol On each level there can be at most 2k lost blocks Client can recreate the entire level of hashes using the 2k erasure hashes and recovering the lost hashes
21
An Efficient Single-Round Protocol Theorem: Given a bound k on the edit distance between f old and f new, the erasure-based file synchronization algorithm correctly updates f old to f new with probability at least 1-1/n, using a single message of O(k lg(n) lg(n/k)) bits We note that there are highly efficient single-message protocols for estimating the file distance k Another property of the protocol is that by broadcasting a single message, the current version can be communicated to several clients that have different outdated versions
22
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes Implementation Overview Preliminary Results Conclusions
23
A Practical Protocol Based on Erasure Codes Previous protocol has two main shortcomings: The protocol requires us to estimate an upper bound on the file distance k. An underestimation would make the recovery impossible at the client More importantly, the algorithm does not support compression of unmatched literals To address these problems we design another erasure- based algorithm that works better in practice
24
A Practical Protocol Based on Erasure Codes The hashes are sent from client to server For level i, mi erasure hashes are sent The server identifies the common blocks and then sends unmatched literals in compressed form
25
Implementation Overview We included three additional optimizations over rsync : Server now transmits the resulting delta and bit vector to allow the client create the same reference file 1)We replace gzip algorithm used for transmission of the unmatched literals and match tokens with an optimized delta compressor
26
Implementation Overview 3)We integrate decomposable hashes: This technique allows the hash of a child block to be computed from the hashes of its parent and sibling, halving the number of erasure hashes transmitted 2)We make a better choice of the number of bits per hash: We assume some upper bound on the probability of a collision, say 1/2^d for some d, then we use lg(n)+lg(y)+d bits per hash n is the file size y is the total number of hashes sent from client to server
27
Preliminary Results For the experiments we used the gcc and emacs datasets, consisting of 2.7.0 and 2.7.1 of gcc and 19.28 and 19.29 of emacs
28
Conclusions We have described a new approach to remote file synchronization based on erasure codes Using this approach, we derived a single- round protocol that is feasible and communication efficient w.r.t a common file distance measure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.