Lecture 11: Data Synchronization Techniques for Mobile Devices © Dimitre Trendafilov 2003 Modified by T. Suel 2004 CS623, 4/20/2004
Problem Definition Given two versions of a data set on different machines, say an outdated and a current one, how can we update the outdated one with minimum communication cost? Related Problem: What if data has been changed in several machines? (How to reconcile data: difficult, application dependent)
Obvious Solutions Send the all of the current data. Compress the current data and then send it. Send only the compressed difference between the two data sets. If the sender has both versions use a suitable delta compression tool. What if the sender has no access to the outdated version?
Two Aspects of the Problem File Synchronization (rsync) Update an outdated file so that it becomes identical to a current one Set Reconciliation (today) Assume you have many small data records, but you only want to send modified records E.g., Database with a set of 100-byte records Unordered: order of records not important Find which records need to be transmitted, then send the entire record Record identified by number (hash, record ID)
Applications for Data Synchronization Synchronizing data between PDA and PC Microsoft briefcase etc. Synchronizing databases over a network Synchronizing a file system in two stages: find which files have changed (MD5 of files) use rsync on those that have changed
Palm Hot Sync Relies on metadata maintained on both machines. The metadata is stored in Palm DB There is one Palm DB for each application (Date Book, To Do, Address Book, etc) A record in Palm DB consist of unique id, pointer to the object, and status flag.
Palm Hot Sync Preferred mode of operation: Fast Sync Exchange only the modified records. Works only if the synchronization is done between two machines.
Palm Hot Sync “Backup” mode of operation: Slow Sync Copy all of the data. Used when the last synchronization was done with different machine.
Timestamps Maintain a timestamp for each record. Send only the records with timestamp greater then timestamp of the last synchronization Good for synchronization between two machines but inefficient for more
SyncML ( now part of Open Mobile Alliance) Fairly large initiative funded by Ericsson, IBM, Lotus, Matsushita, Motorola, Nokia Seeks to provide an open standard for synchronization between different platforms and devices Uses XML Based on timestamps A device stores a timestamp for each record and each device it communicates with. N records and M devices result in N*M timestamps Not scalable!
Intellisync Anywhere Developed by Puma Technologies. Relies on a central server Similar to Fast Sync, but each devices synchronizes only with the central server. It has a single point of failure The central server can get congested
Intellisync Anywhere Puma technologies
Characteristic Polynomial Interpolation Synchronization (CPISync) Time/bandwidth complexity depends on the number of differences. Computationally expensive – cubic in the number of differences But can be improved Computations could be done on only one of the two devices (the faster one) Works in general peer-to-peer environment
CPISync Preliminaries Each data set can be represented as a set of numbers [using hash functions]. A characteristic polynomial for a sets is: Note that for two polynomials S A and S B
CPISync Host A and B evaluate their characteristic polynomials and at the same sample points,. Host B sends to host A its evaluations The evaluations are combined at host A to compute. The zeroes in and are determined. Those are the differences!
CPISync
IPSync – Finding the Number of Differences Guess a bound. Send evaluations at k random points Verify at k points Repeat with another bound if needed. The probability for error is:
IPSync vs. Slow Sync
Taxonomy of Synchronization Techniques
More Techniques: Bloom Filters Get a bloom filter for the receivers data set Send only elements that are not found in the bloom filter.
More Techniques: Using Error Correction Codes Send error correction code for the data set The receiver, “correct the errors” in its outdated data set. Reed-Solomon Codes Decoding time depends only on the number of differences between the sets (almost linear, not cubic) But extra factor of 2 transmission