Rsync: Efficiently Synchronizing Files Using Hashing By David Shao For CS 265, Spring 2004
Problem Want to synchronize with newer version of a file on a remote server Want to minimize data sent over slow network link Want to minimize (round-trip) communication latencies
Solution: Rsync Open source software project Command line driven server and client for Unix-like systems Synchronizes directories as well as files Andrew Tridgell’s Ph.D. thesis
Overview of How Hashing Used Can reduce amount of data sent if willing to live with a very small probability of inaccuracy Several layers of hashing—fast but less accurate and slower but almost always accurate both used
Ideal Case Divide files into equal-sized blocks Files are almost identical except for relatively few blocks Have almost all of the data blocks one needs—but how to know it. Receiver Sender
Ideal Protocol Receiver Sender Hashes of blocks Commands on how to build file
Sender Analyzes Own Blocks Hash Receiver Block 1 Hash Receiver Block 2 Hash Receiver Block 3 Hash Receiver Block 4 Hash Sender Block ?
Commands: Copy or Add COPY: If the receiver already has the data block, just tell him to copy it. ADD: If the receiver does not have a data block, send it to him. COPY cheap, ADD expensive
Advantage of Ideal If COPY, reduction in network traffic by factor approximately L / h, where L is the block size and h is the size of a hash of a block of size L
Disadvantage of Ideal Example: Edit source code, delete a comment at the beginning Blocks no longer neatly aligned
Compute More Hashes Sender needs to compute hash at every byte position More expensive: L times more hashes computed by sender Use weaker, faster hash to weed out
Ordinary Sum of Bytes Rolling-type property: sum of L bytes starting at position i+1 almost the same as sum starting at i. Subtract red, add green, yellow same Sum starting at i Sum starting at i+1
Disadvantage of a Simple Sum A simple sum is too symmetric Sum of “All men are mortals” is the same as “All mortals are men”
Weighted Sum First bytes have more weight than the tail ones—arbitrary decision
Reordering the i + 1 Sum Red part to be subtracted and the green part to be added. Yellow is same
Further Enhancements Compute separate (MD4) signature for entire file Reconstruct new file using temporary storage so that the old version is never removed until a new one is known to be good
Synchronizing Directories Divide into separate receiver/generator Receiver Generator Sender
Summary of Hashing Used Weaker easier to compute hash with the rolling property Stronger hash (MD4) once most candidates have been weeded out Signature over entire file as a separate check