Peeling Arguments Invertible Bloom Lookup Tables and Biff Codes

Slides:

Advertisements

Similar presentations

Randomness Conductors Expander Graphs Randomness Extractors Condensers Universal Hash Functions

Advertisements

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.

Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.

Great Theoretical Ideas in Computer Science

NP-Hard Nattee Niparnan.

Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.

Reconciling Differences: towards a theory of cloud complexity George Varghese UCSD, visiting at Yahoo! Labs 1.

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.

Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich.

UMass Lowell Computer Science Graduate Analysis of Algorithms Prof. Karen Daniels Spring, 2010 Lecture 3 Tuesday, 2/9/10 Amortized Analysis.

Bloom Filters Kira Radinsky Slides based on material from:

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Tirgul 9 Amortized analysis Graph representation.

Asymptotic Enumerators of Protograph LDPCC Ensembles Jeremy Thorpe Joint work with Bob McEliece, Sarah Fogal.

UMass Lowell Computer Science Graduate Analysis of Algorithms Prof. Karen Daniels Spring, 2009 Lecture 3 Tuesday, 2/10/09 Amortized Analysis.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.

CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Hash Tables1 Part E Hash Tables  

Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)

Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.

Fitting a Model to Data Reading: 15.1,

RAPTOR CODES AMIN SHOKROLLAHI DF Digital Fountain Technical Report.

Hash Tables1 Part E Hash Tables  

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Lecture 10: Search Structures and Hashing

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

1 Verification Codes Michael Luby, Digital Fountain, Inc. Michael Mitzenmacher Harvard University and Digital Fountain, Inc.

History-Independent Cuckoo Hashing Weizmann Institute Israel Udi WiederMoni NaorGil Segev Microsoft Research Silicon Valley.

CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.

CS774. Markov Random Field : Theory and Application Lecture 10 Kyomin Jung KAIST Oct

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.

Fixed Parameter Complexity Algorithms and Networks.

Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.

Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.

Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.

TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.

Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.

David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.

15-853:Algorithms in the Real World

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,

Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

Raptor Codes Amin Shokrollahi EPFL. BEC(p 1 ) BEC(p 2 ) BEC(p 3 ) BEC(p 4 ) BEC(p 5 ) BEC(p 6 ) Communication on Multiple Unknown Channels.

Error-Correcting Code

1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.

Approximation Algorithms based on linear programming.

Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.

Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.

CHAPTER SIX T HE P ROBABILISTIC M ETHOD M1 Zhang Cong 2011/Nov/28.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.

CSCE 411 Design and Analysis of Algorithms

Approximating the MST Weight in Sublinear Time

Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal

Bloom filters From Probability and Computing

Switching Lemmas and Proof Complexity

Lecture-Hashing.

Presentation transcript:

Peeling Arguments Invertible Bloom Lookup Tables and Biff Codes Michael Mitzenmacher

Beginning Survey some of the history of peeling arguments.

A Matching Peeling Argument Take a random graph with n vertices and m edges. Form a matching greedily as follows: Find a vertex of degree 1. Put the corresponding edge in the matching. Repeat until out of vertices of degree 1.

History Studied by Karp and Sipser in 1981. Threshold behavior: showed that if m < en/2, then the number of leftover edges is o(n). Used an analysis based on differential equations.

A SAT Peeling Argument Random kSAT and the pure literal rule Typical setting: m clauses, n variables, k literals per clause Greedy algorithm: While there exists a literal that appears 0 times, set its value to 0 (and its negation to 1). Peel away any clauses with its negation. Continue until done.

Random Graph Interpretation Bipartite graph with vertices for literal pairs on one side, vertices for clauses on the other. When neighbors for one literal in a pair drops to 0, can remove both of them, and neighboring clauses.

History Studied by Broder Frieze Upfal (1996). Put in fluid limit/differential equations framework by Mitzenmacher. Threshold: when m/n < ck, solution found with high probability.

A Peeling Paradigm Start with a random graph Peel away part of the graph greedily. Generally, find a node of degree 1, remove the edge and the other adjacent node, but there are variations. k-core = maximal subgraph with degree at least k. Find vertex with degree less than k, remove it and all edges, continue. Analysis approach: Left with a random graph (with a different degree distribution). Keep reducing the graph all the way down. To an empty graph, or a stopping point.

Not Just for Theory Applications to codes, hash-based data structures. These ideas keeps popping up. Resulting algorithms and data structures are usually very efficient. Simple greedy approach. Low overhead accounting (counts on vertices). Often yields linear or quasi-linear time. Useful for big data applications.

Low Density Parity Check Codes

Decoding by Peeling

Decoding Step indicates right node has one edge

Decoding Results Successful decoding corresponds to erasing the entire bipartite graph. Equivalently: graph has an empty 2-core. Using peeling analysis can determine when graph 2-core is empty with high probability. Thresholds can be found by Differential equation analysis “And-Or” tree analysis

Tabulation Hashing Example : Hash 32 bit string into a 12 bit string Char 1 Char 2 Char 3 Char 4 00000000 101001010101 000011010101 111100111100 010011110001 00000001 110100101010 101010101010 101101010101 101010000111 00000010 001001011010 101010100000 000010110101 100101000111 00000011 010100001100 001010100100 101010010111 101010101100 00000100 010100101000 101010100111 000011001011 010100110101 00000101 010101010000 000010010101 111010100001 101001010001 00000110 110011001110 110010101110 000110101010 101011110101 00000111 101010100001 111000011110 000011010000 101010100010 ….. Example : Hash 32 bit string into a 12 bit string using a table of 32*12 random bits.

Tabulation Hashing Hash(00000000000000100000011100000001) Char 1 101001010101 000011010101 111100111100 010011110001 00000001 110100101010 101010101010 101101010101 101010000111 00000010 001001011010 101010100000 000010110101 100101000111 00000011 010100001100 001010100100 101010010111 101010101100 00000100 010100101000 101010100111 000011001011 010100110101 00000101 010101010000 000010010101 111010100001 101001010001 00000110 110011001110 110010101110 000110101010 101011110101 00000111 101010100001 111000011110 000011010000 101010100010 ….. Hash(00000000000000100000011100000001)

Tabulation Hashing Hash(00000000000000100000011100000001) Char 1 101001010101 000011010101 111100111100 010011110001 00000001 110100101010 101010101010 101101010101 101010000111 00000010 001001011010 101010100000 000010110101 100101000111 00000011 010100001100 001010100100 101010010111 101010101100 00000100 010100101000 101010100111 000011001011 010100110101 00000101 010101010000 000010010101 111010100001 101001010001 00000110 110011001110 110010101110 000110101010 101011110101 00000111 101010100001 111000011110 000011010000 101010100010 ….. Hash(00000000000000100000011100000001)

Tabulation Hashing Hash(00000000000000100000011100000001) Char 1 Char 2 Char 3 Char 4 00000000 101001010101 000011010101 111100111100 010011110001 00000001 110100101010 101010101010 101101010101 101010000111 00000010 001001011010 101010100000 000010110101 100101000111 00000011 010100001100 001010100100 101010010111 101010101100 00000100 010100101000 101010100111 000011001011 010100110101 00000101 010101010000 000010010101 111010100001 101001010001 00000110 110011001110 110010101110 000110101010 101011110101 00000111 101010100001 111000011110 000011010000 101010100010 ….. Hash(00000000000000100000011100000001) = 101001010101 xor 101010100000 xor 000011010000 xor 101010000111

Peeling and Tabulation Hashing Given a set of strings being hashed, we can peel as follows: Find a string that in the set that uniquely maps to a specific (character value,position). Remove that string, and continue. The peeled set has completely independent hashed values. Because each one is xor’ed with its very own random table value.

Tabulation Hashing Result due to Patrascu and Thorup. Lemma: Suppose we hash n < m1-eps keys of c chars into m bins, for some constant eps. For any constant a, all bins get less than d = ((1+a)/eps)c keys with probability > 1 – m-a. Bootstrap this result to get Chernoff bounds for tabulation hashing.

Proof Let t = (1+a)/eps, so d=tc. Step 1: Every set of d elements has a peelable subset of size t. Pigeonhole principle says some character position is hit in d1/c characters; otherwise fewer than (d1/c)c =d total elements. Choose 1 for each character hit in that position. Step 2: Use this to bound maximum load. Maximum load is d implies some t-element peelable set landed in the same bit. At most (n choose t) such sets; m-(t-1) probability each set lands in the same bin. Union bound gives heavily loaded bin with prob m-a.

End Survey Now, on to some new stuff.

Invertible Bloom Lookup Tables Michael Goodrich Michael Mitzenmacher

Bloom Filters, + Values + Listing Bloom filters useful for set membership. But they don’t allow (key,value) lookups. Bloomier filters, extension to (key,value) pairs. Also based on peeling methods. They also don’t allow you to reconstruct the elements in the set. Can we find something that does key-value pair lookups, and allows reconstruction? Invertible Bloom Lookup Table (IBLT)

Functionality IBLT operations Insert (k,v) Delete (k,v) Get(k) Returns value for k if one exists, null otherwise Might return “not found” with some probability ListEntries() Lists all current key-value pairs Succeeds as long as current load is not too high Design threshold

Listing, Details Consider data streams that insert/delete a lot of pairs. Flows through a router, people entering/leaving a building. We want listing not at all times, but at “reasonable” or “off-peak” times, when the current working set size is bounded. If we do all the N insertions, then all the N-M deletions, and want a list at the end, we want… Data structure size should be proportional to listing size, not maximum size. Proportional to M, not to N! Proportional to size you want to be able to list, not number of pairs your system has to handle.

Sample Applications Network flow tracking Oblivious table selection Track flows on insertions/deletions Possible to list flows at any time – as long as the network load is not too high If too high, wait till it gets lower again Can also do flow lookups (with small failure probability) Oblivious table selection Database/Set Reconciliation Alice sends Bob an IBLT of her data Bob deletes his data IBLT difference determines set difference

Possible Scenarios A nice system A less nice system Each key has (at most) 1 value Delete only items that are inserted A less nice system Keys can have multiple values Deletions might happen for keys not inserted, or for the wrong value A further less nice system Key-value pairs might be duplicated

The Nice System (k,v) pair hashed to j cells Insert : Increase Count, KeySum ValueSum Insert : Increase Count, Update KeySum and ValueSum Delete : Decrease Count, Get : If Count = 0 in any cell, return null Else, if Count = 1 in any cell and KeySum = key, return ValueSum Else, return Not Found

Get Performance Bloom filter style analysis Let m = number of cells, n = number key-value pairs, j = number of hash functions Probability a Get for a key k in the system returns “not found” is Probability a Get for a key k not in the system returns “not found is”

The Nice System : Listing While some cell has a count of 1: Set (k,v) = (KeySum,ValueSum) of that cell Output (k,v) Call Delete(k,v) on the IBLT

Listing Example 3 2 3 3 2 2 1 4 2 1

The Nice System : Listing While some cell has a count of 1: Set (k,v) = (KeySum,ValueSum) of that cell Output (k,v) Call Delete(k,v) on the IBLT Peeling Process. This is the same process used to find the 2-core of a random hypergraph. Same process used to decode families of low-density parity-check codes.

Listing Performance Results on random peeling processes Thresholds for complete recovery depend on number of hash functions Interesting possibility : use “irregular” IBLTs Different numbers of hash functions for different keys Same idea used in LDPC codes J 3 4 5 6 m/n 1.222 1.295 1.425 1.570

Fault Tolerance Extraneous deletions What about a count of -1? Now a count of 1 does not mean 1 key in the cell. Might have two inserted keys + one extraneous deletion. Need an additional check: hash the keys, sum into HashKeySum. If count is 1, and the hash of the KeySum = HashKeySum, then 1 key in the cell. What about a count of -1? If count is -1, and the hash of -KeySum = -HashKeySum, then 1 key in the cell.

Fault Tolerance Keys with multiple values Need another additional check; HashKeySum and HashValueSum. Multiply-valued keys “poison” a cell. The cell is unusable; it will never have a count of 1. Small numbers of poisoned cells have minimal effect. Usually, all other keys can still be listed. If not, number of unrecovered keys is usually small, like 1.

Biff Codes Using IBLTs for Codes George Varghese Michael Mitzenmacher

Result Fast and simple low-density parity-check code variant. For correcting errors on q-ary channel, for reasonable-sized q. All expressed in terms of “hashing” No graphs, matrices needed for coding. Builds on intrinsic connection between set reconciliation and coding. Worth greater exploration. Builds on intrinsic connection between LDPC codes and various hashing methods. Previously used for e.g. thresholds on cuckoo hashing.

Reminder: Invertible Bloom Lookup Tables A Bloom filter like structure for storing key-value pairs (k,v). Functionality: Insert (k,v) Delete (k,v) Get(k) Returns value for k if one exists, null otherwise Might return “not found” with some probability ListEntries() Lists all current key-value pairs Succeeds as long as current load is not too high Design threshold

The Basic IBLT Framework (k,v) pair hashed to j cells Count KeySum ValueSum Insert : Increase Count, Update KeySum and ValueSum Delete : Decrease Count, Notes: KeySum and ValueSum can be XORs Counts can be negative! (Deletions without insertion.) In fact, in our application, counts are not needed!

Set Reconciliation Problem Alice and Bob each hold a set of keys, with a large overlap. Example: Alice is your smartphone phone book, Bob is your desktop phone book, and new entries or changes need to be synched. Want one/both parties to learn the set difference. Goal: communication is proportional to the size of the difference. IBLTs yield an effective solution for set reconciliation. (Used in code construction…)

Biff Code Framework Alice has message x1, x2, ..., xn. Creates IBLT with ordered pairs as keys, (x1,1), (x2,2), … (xn,n). Values for the key are a checksum hash of the key. Alice’s “set” is n ordered pairs. Sends message values and IBLT. Bob receives y1, y2, … yn. Bob’s set has ordered pairs (y1,1), (y2,2), … (yn,n). Bob uses IBLT to perform set reconciliation on the two sets.

Reconciliation/Decoding For now, assume no errors in IBLT sent by Alice. Bob deletes his key-value pairs from the IBLT. What remains in the IBLT is the set difference, corresponding to symbol errors. Bob lists elements of the IBLT to find the errors.

Decoding Process Suppose a cell has one pair in it. Then the checksum should match with the key value. And, if more than one pair, checksum should not match with the key value. Choose checksum length so no false matches with good probability. If checksum matches, recover the element, and delete it from the IBLT. Continue until all pairs recovered.

Peeling Process Keys Hash Table

Analysis Analysis follows standard approaches for LDPC codes. E.g., differential equation, fluid limit analysis. Chernoff-like bounds on behavior. Overheads Set difference size = 2x number of errors. Could we get rid of factor of 2? Decoding structure overhead. Best not to use “regular graph” = same number of hash functions per item; but simpler to do so.

Fault Tolerance What about errors in IBLT cells? The cell is “bad”, can’t be used for recovery. If the checksum works, low probability of decoding error. Enough bad IBLT cells will harm decoding. Most likely scenario: one key-value pair has all of its IBLT cells go bad; it cannot be recovered. So most likely error: 1 unrecovered value (or small number). Various remedies possible. Small additional error-correction in original message. Recursively protect IBLT cells with a code.

Simple Code Code is essentially a lot of hashing, XORing of values.

Experimental Results Test implementation. 1 million 20-bit symbols. 20 bits also describes location, used for checksum. All computations are 64 bits. IBLTs of 30000 cells. 10000 symbol errors (1% rate) 600 IBLT errors (2% rate).

Experimental Results Parameters chosen so we expect rare but noticeable failures with 4 hash functions. All IBLT cells for some symbol in error in approximately 1.6 x 10-3 trials. But less so with 5 hash functions. Failure once in approximately 3.2 x 10-5 trials. 16 failures in 1000 trials for 4 hash functions, none for 5 hash functions. Failures are all 1 unrecovered element! Experiments match analysis.

Experimental Results : Timing Less than 0.1 seconds per decoding. Most of the time is simply putting elements into the hash table. 4 hash functions: 0.0561 seconds on average to load data into table, 0.0069 for subsequent decoding. 5 hash functions: 0.0651 seconds on average to load data into table, 0.0078 for subsequent decoding. Optimizations, parallelizations possible.

Conclusions for Biff Codes Biff Codes are extremely simple LDPC-style codes for symbol-level errors. Simple code. Very fast. Some cost in overhead. Expectation: will prove useful for large-data scenarios. Didactic value: an easy way to introduce LDPC concepts.

Peeling Peeling method is a useful approach when trying to analyze random graph processes. Many problems can be put in this framework. Matchings Codes Hash tables Bloom filter structures Designing processes to use peeling leads to quick algorithms. Simple, time-efficient; often space tradeoffs. What data structures/algorithms can we put in the peeling framework?