Network Applications of Bloom Filters: A Survey

Slides:

Advertisements

Similar presentations

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Bloom Filters Kira Radinsky Slides based on material from:

Hashing CS 3358 Data Structures.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.

Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)

COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.

The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.

March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.

March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.

Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.

Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,

CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.

Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.

Virtual University of Pakistan

Lesson 8: Basic Monte Carlo integration

Bloom Filters An Introduction and Really Most Of It CMSC 491

Probabilistic Algorithms

12. Principles of Parameter Estimation

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Lower bounds for approximate membership dynamic data structures

LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:

The Variable-Increment Counting Bloom Filter

CS 332: Algorithms Hash Tables David Luebke /19/2018.

Hashing Alexandra Stefan.

Copyright © Cengage Learning. All rights reserved.

Hashing CENG 351.

Subject Name: File Structures

Hashing Alexandra Stefan.

Review Graph Directed Graph Undirected Graph Sub-Graph

Internet Networking recitation #12

Hash functions Open addressing

Overview and Basics of Hypothesis Testing

Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal

Edge computing (1) Content Distribution Networks

Bloom Filters Very fast set membership. Is x in S? False Positive

Indexing and Hashing Basic Concepts Ordered Indices

Database Design and Programming

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

How to use hash tables to solve olympiad problems

Compact routing schemes with improved stretch

Data Structures – Week #7

Bloom filters From Probability and Computing

Hash Functions for Network Applications (II)

CS 3343: Analysis of Algorithms

Lecture 1: Bloom Filters

12. Principles of Parameter Estimation

CSE 326: Data Structures Lecture #14

Lecture-Hashing.

Presentation transcript:

Network Applications of Bloom Filters: A Survey Andrei Broder and Michael Mitzenmacher Presenter: Chen Qian Slides credit: Hongkun Yang

Outline Bloom Filter Overview Historical Applications Standard Bloom Filters Counting Bloom Filters Historical Applications Network Applications Distributed Caching P2P/Overlay Networks Resource Routing Conclusion First of all, we will talk about mathematics behind the bloom filter. We will consider standard bloom filters, and a extension, counting bloom filters. We will see how they work, their operations, and how to optimize them. Then I will briefly introduce historical applications of BF. BF is invented in 1970s, about 40 years old. In early time, memory is scares in computers. People use BF to dramatically reduce the usage of memory of programs. Then we will take more time on modern applications of BF in the area of computer networking. I will give examples from distributed caching, p2p networks, and resource routing. Finally we draw conclusions of this lecture

Overview Burton Bloom introduced it in 1970s Randomized data structure Representing a set to support membership queries Dramatic space savings Allow false positives BF is introduced by Burton Bloom in the 1970s, and is named after him. Since then they have been very popular in database applications. Recently they started receiving more interest in the networking literature. A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. By membership queries, you are given a set and an element, and you can use BF to check whether the element is in the set. BF has dramatic space savings, but the cost is it will have a small false positive rate. So it is very useful when memory consumption is of great concern and a small false positive rate can be tolerated.

Standard Bloom Filters: Notations S the set of n elements {x1, x2, …, xn} k independent hash functions h1, …, hk with range {1, …, m}. Assume: hash functions map each item in the universe to a random number uniformly over the range {1, …, m} MD5 An array B of m bits, initially filled with 0s Now let’s look at standard BF. We need to introduce some notations. Let S be a set of n elements. We use x1, x2, dot dot dot to denote its elements. We have k independent hash functions. Let us denote them h1 through hk. These hash functions will hash each element in the S to a number within the range from 1 to m. For the convenience of mathematical analysis, here we assume hash functions map each item in the universe to a random number uniform over the range {1, …, m}. This means that if you uniformly pick up an element in the universe at random, then the value of hash functions will be a random variable uniformly distributed over the range 1…m. in practice, hash functions cannot exactly follow this assumption. But the paper suggest that MD5 fits the assumption well. A BF can be denoted by an array of m bits B. initially all bits are set to 0.

Standard Bloom Filters: How It Works Hash each xi in S k times. If Hj(xi) = 1, set B[=1. To check whether y is in S, check B at H_j(y), j = 1,2,…,k If all k values are set to 1, y is assumed to be in S, If not, y is clearly not in S. No False Negative How does standard BF work? First we hash each element in S k times, using k hash functions. If a hash function maps the element x to a, then set the ath bit of B to 1. given an element y, to check whether y is in S, check k bits in B, H1(y), H2(y)…Hk(y). They are the values of k hash functions. If all k bits are set to 1, we claim that y is in S. if not all k bits are set to 1, which means that there are some 0 bits, then we claim that y is not in S. here let’s look at the cases. In the second case, it is obvious that y is not in S. otherwise, all k bits are set to 1. so this case will not have problems. That is, there is no false negative. But in the first case, it is not necessarily true. It is possible that y is not in S and the k corresponding bits are set to 1 by some (say, two or three) elements in S. then we will have a false positive rate. Possible False Positive

Standard Bloom Filters: An Example B Now we consider an example of how BF works. From this example, we can see how a false positive in BF happens. Assume that the BF has 6 bits. Initially they are set to 0. there are two hash functions. INTIAL STATE

Standard Bloom Filters: An Example 1 1 1 B Now we insert two elements in S into the BF. X1 is mapped two times by the two hash functions and the corresponding bits are set to 1. Then x2. INSERTION

Standard Bloom Filters: An Example y1 y2 1 1 1 B Now using the BF, we check two elements y1, y2, whether they are in set S. y1 is hashed two times, and we find that both the corresponding bits are not set to 1. so we can convincingly conclude that y1 is not in S. since BF has no false negatives. Then we check y2. We find that both the bits have been set to 1 by x1 and x2 respectively, as we have shown in the previous slide. Then we falsely claim that y2 is in the set S. So a false positive occurs. CHECK

Standard Bloom Filters: False Positive Rate (1) Pr[a given bit in B is 0]= The probability of a false positive is Let r be the proportion of 0 bits after all elements are inserted in the Bloom filter Conditioned on r, the probability of a false positive is Now we calculate the false positive rate of the BF. Under the assumption that hash functions are uniformly random, we can see that the probability of a given bit in B is still zero after inserting n elements in S is p prime. This is not hard to see. Each element in S is hashed k times and there are n elements. We can consider that when an element is hashed once, one bit is chosen randomly and set to 1, and this will be done in kn times. 1-1/m is the probability that a bit is not chosen. So after kn times, a given bit is still 0 with probability p’. We usually use the approximation p in place of the exact value p’ because it is a very accurate approximation and it can make things simpler when we do the math. Then the false positive rate is the probability that an element not in S is accepted by the BF. This will happen if and only if all k corresponding bits are set to 1. So the false positive rate is f’, and we can use f as an approximation. Here the false positive rate is in an average sense. By average I mean we randomly choose the set S of size n in the universe, then consider the false positive rate. Now let us consider a conditional false positive rate. Let rho be the proportion of 0 bits after all elements are inserted in the BF. If the set S is selected at random, then we have that the expected value of rho is p prime. If we condition on rho, i.e. rho is already known, then an element not in S is hashed to a 1 bit with probability 1-rho. So conditioned on rho, the false positive rate is 1-rho over k. The question is why do we need conditional false positive rate? If rho has a large variation, then the false positive rate also has a large variation. This means the false positive rate heavily depends on the set S inserted in the BF. Then the average false positive rate will not give us too much insight.

Standard Bloom Filters: False Positive Rate (2) The fraction of 0 bits is extremely concentrated around its expectation Therefore, with high probability, However, theoretical results show that rho is close to its mean value with high probability. So we can have this approximation. This is true with high probability.

Standard Bloom Filters: Optimal Number of Hash Functions (1) Two competing forces: More hash functions gives more chances to find a 0 bit for an element that is not a member of S Fewer hash functions increases the fraction of 0 bits in the array Now we consider the optimal number of hash functions. Actually there are two competing forces. Firstly, if we have more hash functions, we can have more more chances to find a 0 bit for an element that is not a member of S, since BF accepts an element if only all corresponding values are set to 1. secondly fewer hash functions can increase the fraction of 0 bits in the array since we may set fewer bit to 1 when inserting an element. So there should be an optimal value.

Standard Bloom Filters: Optimal Number of Hash Functions (2) This is a numerical example. M is the length of the BF, n is the size of the set S. From the graph, we can see that at some point, the false positive rate is minimized. In this case, the optimal k is about 5.5

Standard Bloom Filters: Optimal Number of Hash Functions (3) Note that Let g=kln(1-e-kn/m) , solve Rewrite g as where p Using symmetry, g is minimal when p = ½ Then We can solve the optimal number of hash functions using calculus. We consider the false positive rate f. we can get rid of the exponential and consider g instead. Since the exponential function is increasing, so once g is minimized, f is minimized. Then we take the derivative of g with respect to k and solve this equation for k. it can be shown that k is the optimal nuber of hash functions. But we can avoid doing calculus by exploiting the symmetry of g. g can be rewrite as follows where p is the probability that a specific bit is still 0. according to symmetry, we can guess that g is minimized when p equals 1 over 2. this is true actually. At this time k=ln2 times m over n.

Standard Bloom Filters: Space Efficiency A lower bound Let e be the false positive ratio, then The optimal case The false posive rate for the optimal Bloom filter is Let f>e Now we consider how many bits a BF need if we give a threshold of false positive rate. Using some tricks, we can get a lower bound. We can consider a BF as a mapping from a set of size n to an m bits binary string. If we consider the optimal BF. We can see this is the false positive rate f. let f be less the epsilon, then we can have this equation. So the optimal BF is within a factor of 1.44 of the lower bound.

Standard Bloom Filters: Operations (1) Union Build a Bloom filter representing the union of A and B by taking the OR of BF(A) and BF(B) Shrinking a Bloom filter Halving the size by taking the OR of the first and the second half of the Bloom filter Increase false positive rate The intersection of two sets Let us look at the operations of BF. Some operations can be easily done in BF. If we have two BF which representing set A and B respectively. Then the bloom filter representing the union of A and B is the OR of the corresponding two Bloom filters. And if you want to reduce the size of the BF, you can take the OR of the first and the second half of the BF. Then you can have a half-sized BF, but the false positive rate will be increased. It seems that there is no simple ways to generate a BF representing the intersection of two sets from their BFs. Note that simply taking the AND of the two BFs does not make the BF that represents the intersection set. But we still find ways to estimate the size of their intersection set.

Standard Bloom Filters: Operations (2) The intersection of S1 and S2 The average number of 1 bits in the AND of BF(S1) and BF(S2) Z1 the number of 0 bits in BF(S1), Z2 BF(S2), Z12 the AND of BF(S1) and BF(S2) So we can take the AND of the two BFs and count the number of 1 bits. Let this number be equal to the average number. And we can solve the equation for the size of the intersection set. Then we can have the estimation of the size.

Counting Bloom Filters: Motivation Standard Bloom filters Easy to insert elements Cannot perform deletion operations Counting Bloom filters Each entry is not a single bit but a small counter Insert an element: increment the corresponding counters Delete an element: decrement the corresponding counters So far, we have talked about standard BF. Standard BF is easy to insert elements. You just set the corresponding bits to one. But if the set of elements is changing over time, We may need to delete old elements. for example, the content in a web cache. Unpopular items are deleted and popular items are inserted. But we cannot perform deletion in Standard BF. Once you set a bit to 0, it is possible that some other elements are also hashed to this bit. Then the BF is not correct. To address this problem, the idea of counting BF is introduced. We will talk about the applications of counting BF later. Now let us look at how it works. In a counting BF,

Counting Bloom Filters: An Example B Now we consider an example of how BF works. From this example, we can see how a false positive in BF happens. Assume that the BF has 6 bits. Initially they are set to 0. there are two hash functions. INTIAL STATE

Counting Bloom Filters: An Example 1 2 1 1 B Now we insert two elements in S into the BF. X1 is mapped two times by the two hash functions and the corresponding bits are set to 1. Then x2. INSERTION

Counting Bloom Filters: An Example 1 1 2 1 B Now using the BF, we check two elements y1, y2, whether they are in set S. y1 is hashed two times, and we find that both the corresponding bits are not set to 1. so we can convincingly conclude that y1 is not in S. since BF has no false negatives. Then we check y2. We find that both the bits have been set to 1 by x1 and x2 respectively, as we have shown in the previous slide. Then we falsely claim that y2 is in the set S. So a false positive occurs. DELETION

Countering Bloom Filters: How Large Counters Do We Need? (1) n elements, k hash functions, m counters, and c(i) the count associated with the ith counter The tail probability is bounded by Then use the union bound again Then one will ask, how large counters do we need? If the counter is too small, then it will overflow after inserting some elements. If it is two large, then it will be a waste of resource. The value of the counter can be considered as the binomial random variable. The probability can be written as follows.

Countering Bloom Filters: How Large Counters Do We Need? (2) 4 bits per counter is enough The maximum counter value is O(log m) with high probability, and hence O(loglog m) bits are sufficient Let j = 3ln m/ lnln m If each counter has 4 bits, then the maximal value of counter is 15. because the value of the counter should be 0-15. so the counting BF overflow when the maximal value of these counters is larger than 15. actually we can prove that the maximal counter value is O(log m) with high probability.

Historical Applications Dictionaries Hyphenation programs UNIX spell-checkers Dictionary of unsuitable passwords Databases Semi-join operations Differential files Now let us look at the applications of BFs. In early days, memory is a scarce resource in computers. BFs are succinct representations of sets of items. So programm at that times used BF to reduce the usage of memory. In the dictionaries applications, BF are used to represent a dictionary of words. For example, in spell-checkers, BF is used to store all correct words. Memory is saved, but the implication of false positive is that the spell-checker is not 100% accurate. It is possible that you misspell a word, but it is accepted by the checker. BF are also popular in database applications. They are used to compute set intersection and set difference. Some peer2peer applications also use this idea and we will elaborate them later.

Distributed Caching: Scenario Let us look at more recent applications. First is distributed caching. This work is from university of Wisconsin-Madison. It appeared in sigcomm in 1998. the paper consider the following scenario. When clients send a url request to the web, web proxies will check whether web cache has the desired web page. If so, web proxies will return the clients the desired web page rather than make a request to WEB. Using web caches can reduce latencies and reduce the workload of web

Distributed Caching: Summary Cache Motivation Sharing of caches among Web proxies to reduce Web traffic and alleviate network bottlenecks Directly sharing lists of URLs has too much overhead Solution Use Bloom filters to reduce network traffic Use a counting Bloom filter to track cache contents Broadcast the corresponding standard Bloom filter to other proxies We can further improve the performance by sharing caches among web proxies. This can further reduce web traffic and alleviate network bottlenecks. Since one web proxy can check whether other proxies has the desired web page rather than directly sending the request to web. However, simply sharing lists of URLs will cause large communication overhead. The solution proposed by the paper is using BF to reduce network traffic. Since content in web caches are changing over time, so it is not suitable to use standard BF because it cannot perform deletion. Then the paper introduce counting BF and use it to track the content of cache. And broadcast the corresponding standard BF to other proxies, because others only need to know whether a particular item is in this web cache.

P2P/Overlay Networks: Content Delivery Problem Peer A has a set of items SA, peer B has SB, B wants useful items from A (SA-SB) Solution B sends A its Bloom filter BF(B) A sends B its items that is not in SB according to BF(B) Implications of false positives Not all elements in SA-SB will be sent Redundant items (e.g. erasure coding) A large fraction of SA-SB is sufficient (not necessarily the entire set) Then we look at some applications in peer2peer networks. In content delivery, the problem can be formulated as follows.

P2P/Overlay Networks: Efficient P2P Keyword Searching (1) Problem Peer A has a set of items SA, peer B has SB, A wants to determine Solution A sends B its Bloom filter BF(A) B sends A its items that appears to be in SA according to BF(A) B eliminates false positives and determines exactly Fewer bits transmitted than A sending the entire set SA Let us look at another application, efficient peer2peer keyword searching. In this application, documents have several keywords. Each peer node is attributed to one keyword. Documents with this keyword will be stored stores documents which have a same keyword. Clients will use for example, two keywords to search documents. Then to find documents with both keywords, we need to compute set intersection.

P2P/Overlay Networks: Efficient P2P Keyword Searching (2) Server A Server B (2) BF(A) 3 4 6 1 2 3 4 3 4 5 6 SA SB Let us look at these procedure. A and B are servers storing the document ID lists for the keywords kA and kB. A and B are the sets of document IDs matching the keywords kA and kB. F(A) is a Bloom filter representation of A. Bloom filters help reduce the bandwidth requirement of "AND" queries. The gray box represents the Bloom filter F(A) of the set A. Note the false positive in the set B ^ F(A) that server sB sends back to server sA, which sA eliminates in order to send A ^ B to the client. 3 4 (1) request Client

Resource Routing (1) Network is in the form of a rooted tree Nodes hold resources Each node keeps Bloom filters representing A unified list of resources that it holds or reachable through one of its children Individual lists of resources for it and each child. When receiving a request for a resource Check the unified list to see whether the node or its descendants hold the resource Yes: check the individual lists No: forward the request up the tree toward the root

Resources Routing (2) Let us see the animation. The red node asks for a resource in the blue node. Then first he will deliver the request to his father.

Conclusion Simple space-efficient representation of a set or a list that can handle membership queries Applications in numerous networking problem Bloom filter principle

THANK YOU!