What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Peeling Arguments Invertible Bloom Lookup Tables and Biff Codes
CSCE430/830 Computer Architecture
Applied Algorithmics - week7
Reconciling Differences: towards a theory of cloud complexity George Varghese UCSD, visiting at Yahoo! Labs 1.
Computer Networking Error Control Coding
Information and Coding Theory
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Fine-Grained Latency and Loss Measurements in the Presence of Reordering Myungjin Lee, Sharon Goldberg, Ramana Rao Kompella, George Varghese.
Fusion Trees Advanced Data Structures Aris Tentes.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Fall 2006CENG 7071 Algorithm Analysis. Fall 2006CENG 7072 Algorithmic Performance There are two aspects of algorithmic performance: Time Instructions.
Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs) 
Dictionaries and Hash Tables1  
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
Data Parallel Algorithms Presented By: M.Mohsin Butt
CSC 2300 Data Structures & Algorithms March 27, 2007 Chapter 7. Sorting.
Lecture 11: Data Synchronization Techniques for Mobile Devices © Dimitre Trendafilov 2003 Modified by T. Suel 2004 CS623, 4/20/2004.
CS 206 Introduction to Computer Science II 12 / 05 / 2008 Instructor: Michael Eckmann.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hash Tables1 Part E Hash Tables  
Lecture 10: Search Structures and Hashing
Gentle tomography and efficient universal data compression Charlie Bennett Aram Harrow Seth Lloyd Caltech IQI Jan 14, 2003.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
1 Verification Codes Michael Luby, Digital Fountain, Inc. Michael Mitzenmacher Harvard University and Digital Fountain, Inc.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
CS 206 Introduction to Computer Science II 12 / 08 / 2008 Instructor: Michael Eckmann.
On Error Preserving Encryption Algorithms for Wireless Video Transmission Ali Saman Tosun and Wu-Chi Feng The Ohio State University Department of Computer.
Data link layer: services
An Evaluation of Using Deduplication in Swappers Weiyan Wang, Chen Zeng.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.
Quantum Error Correction Jian-Wei Pan Lecture Note 9.
Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CPT: Search/ Computer Programming Techniques Semester 1, 1998 Objectives of these slides: –to discuss searching: its implementation,
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Analysis of Algorithms
1 Forward Error Correction in Sensor Networks Jaein Jeong, Cheng-Tien Ee University of California, Berkeley.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Shifted Codes Sachin Agarwal Deutsch Telekom A.G., Laboratories Ernst-Reuter-Platz Berlin Germany Joint work with Andrew Hagedorn and Ari Trachtenberg.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 14.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Error Control Code. Widely used in many areas, like communications, DVD, data storage… In communications, because of noise, you can never be sure that.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
Computer Science Division
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Computer Communication & Networks Lecture 9 Datalink Layer: Error Detection Waleed Ejaz
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
RS – Reed Solomon Error correcting code. Error-correcting codes are clever ways of representing data so that one can recover the original information.
SketchVisor: Robust Network Measurement for Software Packet Processing
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Coding Theory Dan Siewiorek June 2012.
Topic 14: Random Oracle Model, Hashing Applications
ICOM 6005 – Database Management Systems Design
Unequal Error Protection for Video Transmission over Wireless Channels
Minwise Hashing and Efficient Search
Error detection: Outline
Presentation transcript:

What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich & George Varghese 1

Motivation Distributed applications often need to compare remote state. 2 R1 R2 Must solve the Set-Difference Problem! Partition Heals

What is the Set-Difference problem? What objects are unique to host 1? What objects are unique to host 2? A A Host 1Host 2 C C A A F F E E B B D D F F 3

Example 1: Data Synchronization Identify missing data blocks Transfer blocks to synchronize sets A A Host 1Host 2 C C A A F F E E B B D D F F D D C C B B E E 4

Example 2: Data De-duplication Identify all unique blocks. Replace duplicate data with pointers A A Host 1Host 2 C C A A F F E E B B D D F F 5

Set-Difference Solutions Trade a sorted list of objects. – O(n) communication, O(n log n) computation Approximate Solutions: – Approximate Reconciliation Tree (Byers) O(n) communication, O(n log n) computation Polynomial Encodings (Minsky & Trachtenberg) – Let “d” be the size of the difference – O(d) communication, O(dn+d 3 ) computation Invertible Bloom Filter – O(d) communication, O(n+d) computation 6

Difference Digests Efficiently solves the set-difference problem. Consists of two data structures: – Invertible Bloom Filter (IBF) Efficiently computes the set difference. Needs the size of the difference – Strata Estimator Approximates the size of the set difference. Uses IBF’s as a building block. 7

Invertible Bloom Filters (IBF) Encode local object identifiers into an IBF. A A Host 1Host 2 C C A A F F E E B B D D F F IBF 2 IBF 1 8

IBF Data Structure Array of IBF cells – For a set difference of size, d, require αd cells (α > 1) Each ID is assigned to many IBF cells Each IBF cell contains: 9 idSumXOR of all ID’s in the cell hashSumXOR of hash(ID) for all ID’s in the cell countNumber of ID’s assign to the cell

IBF Encode A A idSum ⊕ A hashSum ⊕ H(A) count++ idSum ⊕ A hashSum ⊕ H(A) count++ idSum ⊕ A hashSum ⊕ H(A) count++ idSum ⊕ A hashSum ⊕ H(A) count++ idSum ⊕ A hashSum ⊕ H(A) count++ idSum ⊕ A hashSum ⊕ H(A) count++ Hash1 Hash2 Hash3 B B C C Assign ID to many cells 10 IBF: αdαd “Add” ID to cell Not O(n), like Bloom Filters! All hosts use the same hash functions

Invertible Bloom Filters (IBF) Trade IBF’s with remote host A A Host 1Host 2 C C A A F F E E B B D D F F IBF 2 IBF 1 11

Invertible Bloom Filters (IBF) “Subtract” IBF structures – Produces a new IBF containing only unique objects A A Host 1Host 2 C C A A F F E E B B D D F F IBF 2 IBF 1 IBF (2 - 1) 12

IBF Subtract 13

Timeout for Intuition After subtraction, all elements common to both sets have disappeared. Why? – Any common element (e.g W) is assigned to same cells on both hosts (assume same hash functions on both sides) – On subtraction, W XOR W = 0. Thus, W vanishes. While elements in set difference remain, they may be randomly mixed  need a decode procedure. 14

Invertible Bloom Filters (IBF) Decode resulting IBF – Recover object identifiers from IBF structure. A A Host 1Host 2 C C A A F F E E B B D D F F IBF (2 - 1) B B E E C C D D Host 1Host 2 15 IBF 2 IBF 1

IBF Decode 16 H(V ⊕ X ⊕ Z) ≠ H(V) ⊕ H(X) ⊕ H(Z) H(V ⊕ X ⊕ Z) ≠ H(V) ⊕ H(X) ⊕ H(Z) Test for Purity: H( idSum ) Test for Purity: H( idSum ) H( idSum ) = hashSum H(V) = H(V) H( idSum ) = hashSum H(V) = H(V)

IBF Decode 17

IBF Decode 18

IBF Decode 19

20 Small Diffs: 1.4x – 2.3x Large Differences: 1.25x - 1.4x How many IBF cells? Space Overhead Set Difference Hash Cnt 3 Hash Cnt 4 Overhead to decode at >99%

How many hash functions? 1 hash function produces many pure cells initially but nothing to undo when an element is removed. 21 A A B B C C

How many hash functions? 1 hash function produces many pure cells initially but nothing to undo when an element is removed. Many (say 10) hash functions: too many collisions. 22 A A A A B B C C B B C C A A A A B B B B C C C C

How many hash functions? 1 hash function produces many pure cells initially but nothing to undo when an element is removed. Many (say 10) hash functions: too many collisions. We find by experiment that 3 or 4 hash functions works well. Is there some theoretical reason? 23 A A A A B B C C C C A A B B B B C C

Theory Let d = difference size, k = # hash functions. Theorem 1: With (k + 1) d cells, failure probability falls exponentially. – For k = 3, implies a 4x tax on storage, a bit weak. [Goodrich,Mitzenmacher]: Failure is equivalent to finding a 2-core (loop) in a random hypergraph Theorem 2: With c k d, cells, failure probability falls exponentially – c 4 = 1.3x tax, agrees with experiments 24

25 Large Differences: 1.25x - 1.4x How many IBF cells? Space Overhead Set Difference Hash Cnt 3 Hash Cnt 4 Overhead to decode at >99%

Connection to Coding Mystery: IBF decode similar to peeling procedure used to decode Tornado codes. Why? Explanation: Set Difference is equivalent to coding with insert-delete channels Intuition: Given a code for set A, send codewords only to B. Think of B’s set as a corrupted form of A’s. Reduction: If code can correct D insertions/deletions, then B can recover A and the set difference. 26 Reed Solomon Polynomial Methods LDPC (Tornado) Difference Digest Reed Solomon Polynomial Methods LDPC (Tornado) Difference Digest

Difference Digests Consists of two data structures: – Invertible Bloom Filter (IBF) Efficiently computes the set difference. Needs the size of the difference – Strata Estimator Approximates the size of the set difference. Uses IBF’s as a building block. 27

Strata Estimator A A Consistent Partitioning Consistent Partitioning B B C C 28 ~1/2 ~1/4 ~1/8 1/16 IBF 1 IBF 4 IBF 3 IBF 2 Estimator Divide keys into partitions of containing ~1/2 k Encode each partition into an IBF of fixed size – log(n) IBF’s of ~80 cells each

4x Strata Estimator 29 IBF 1 IBF 4 IBF 3 IBF 2 Estimator 1 Attempt to subtract & decode IBF’s at each level. If level k decodes, then return: 2 k x (the number of ID’s recovered) … IBF 1 IBF 4 IBF 3 IBF 2 Estimator 2 … Decode Host 1 Host 2

4x Strata Estimator 30 IBF 1 IBF 4 IBF 3 IBF 2 Estimator 1 Attempt to subtract & decode IBF’s at each level. If level k decodes, then return: 2 k x (the number of ID’s recovered) … IBF 1 IBF 4 IBF 3 IBF 2 Estimator 2 … Decode Host 1 Host 2 What about the other strata?

2x Strata Estimator IBF 1 IBF 4 IBF 3 IBF 2 Estimator 1 … IBF 1 IBF 4 IBF 3 IBF 2 Estimator 2 … Decode Host 1 Host 2 Host 1 31 Observation: Extra partitions hold useful data Sum elements from all decoded strata & return: 2 (k-1) x (the number of ID’s recovered) Decode Host 1 Host 2 …

Estimation Accuracy 32 Strata good for small differences. Min-Wise good for large differences. Average Estimation Error (15.3 KBytes) Set Difference Relative Error in Estimation (%)

Hybrid Estimator 33 IBF 1 IBF 4 IBF 3 IBF 2 Strata Combine Strata and Min-Wise Estimators. – Use IBF Stratas for small differences. – Use Min-Wise for large differences. … IBF 1 Min-Wise IBF 2 Hybrid IBF 3

Hybrid Estimator Accuracy 34 Hybrid matches Strata for small differences. Converges with Min-wise for large differences Set Difference Average Estimation Error (15.3 KBytes) Relative Error in Estimation (%)

Application: KeyDiff Service Promising Applications: – File Synchronization – P2P file sharing – Failure Recovery Key Service Application Add( key ) Remove( key ) Diff( host1, host2 ) 35

Difference Digests Summary Strata & Hybrid Estimators – Estimate the size of the Set Difference. – For 100K sets, 15KB estimator has <15% error – O(log n) communication, O(log n) computation. Invertible Bloom Filter – Identifies all ID’s in the Set Difference. – 16 to 28 Bytes per ID in Set Difference. – O(d) communication, O(n+d) computation. Implemented in KeyDiff Service 36

Conclusions: Got Diffs? New randomized algorithm (difference digests) for set difference or insertion/deletion coding Could it be useful for your system? Need: – Large but roughly equal size sets – Small set differences (less than 10% of set size) 37

38

Extra Slides 39

Comparison to Logs IBF work with no prior context. Logs work with prior context, BUT – Redundant information when sync’ing with multiple parties. – Logging must be built into system for each write. – Logging add overhead at runtime. – Logging requires non-volatile storage. Often not present in network devices. 40 IBF’s may out-perform logs when: Synchronizing multiple parties Synchronizations happen infrequently IBF’s may out-perform logs when: Synchronizing multiple parties Synchronizations happen infrequently