Network Applications of Bloom Filters: A Survey

Network Applications of Bloom Filters: A Survey
Andrei Broder and Michael Mitzenmacher Presenter: Chen Qian Slides credit: Hongkun Yang

Outline Bloom Filter Overview Historical Applications
Standard Bloom Filters Counting Bloom Filters Historical Applications Network Applications Distributed Caching P2P/Overlay Networks Resource Routing Conclusion First of all, we will talk about mathematics behind the bloom filter. We will consider standard bloom filters, and a extension, counting bloom filters. We will see how they work, their operations, and how to optimize them. Then I will briefly introduce historical applications of BF. BF is invented in 1970s, about 40 years old. In early time, memory is scares in computers. People use BF to dramatically reduce the usage of memory of programs. Then we will take more time on modern applications of BF in the area of computer networking. I will give examples from distributed caching, p2p networks, and resource routing. Finally we draw conclusions of this lecture

Overview Burton Bloom introduced it in 1970s Randomized data structure
Representing a set to support membership queries Dramatic space savings Allow false positives BF is introduced by Burton Bloom in the 1970s, and is named after him. Since then they have been very popular in database applications. Recently they started receiving more interest in the networking literature. A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. By membership queries, you are given a set and an element, and you can use BF to check whether the element is in the set. BF has dramatic space savings, but the cost is it will have a small false positive rate. So it is very useful when memory consumption is of great concern and a small false positive rate can be tolerated.

Standard Bloom Filters: Notations
S the set of n elements {x1, x2, …, xn} k independent hash functions h1, …, hk with range {1, …, m}. Assume: hash functions map each item in the universe to a random number uniformly over the range {1, …, m} MD5 An array B of m bits, initially filled with 0s Now let’s look at standard BF. We need to introduce some notations. Let S be a set of n elements. We use x1, x2, dot dot dot to denote its elements. We have k independent hash functions. Let us denote them h1 through hk. These hash functions will hash each element in the S to a number within the range from 1 to m. For the convenience of mathematical analysis, here we assume hash functions map each item in the universe to a random number uniform over the range {1, …, m}. This means that if you uniformly pick up an element in the universe at random, then the value of hash functions will be a random variable uniformly distributed over the range 1…m. in practice, hash functions cannot exactly follow this assumption. But the paper suggest that MD5 fits the assumption well. A BF can be denoted by an array of m bits B. initially all bits are set to 0.

Standard Bloom Filters: How It Works
Hash each xi in S k times. If Hj(xi) = 1, set B[=1. To check whether y is in S, check B at H_j(y), j = 1,2,…,k If all k values are set to 1, y is assumed to be in S, If not, y is clearly not in S. No False Negative How does standard BF work? First we hash each element in S k times, using k hash functions. If a hash function maps the element x to a, then set the ath bit of B to 1. given an element y, to check whether y is in S, check k bits in B, H1(y), H2(y)…Hk(y). They are the values of k hash functions. If all k bits are set to 1, we claim that y is in S. if not all k bits are set to 1, which means that there are some 0 bits, then we claim that y is not in S. here let’s look at the cases. In the second case, it is obvious that y is not in S. otherwise, all k bits are set to 1. so this case will not have problems. That is, there is no false negative. But in the first case, it is not necessarily true. It is possible that y is not in S and the k corresponding bits are set to 1 by some (say, two or three) elements in S. then we will have a false positive rate. Possible False Positive

Standard Bloom Filters: An Example
B Now we consider an example of how BF works. From this example, we can see how a false positive in BF happens. Assume that the BF has 6 bits. Initially they are set to 0. there are two hash functions. INTIAL STATE

1 1 1 B Now we insert two elements in S into the BF. X1 is mapped two times by the two hash functions and the corresponding bits are set to 1. Then x2. INSERTION

y1 y2 1 1 1 B Now using the BF, we check two elements y1, y2, whether they are in set S. y1 is hashed two times, and we find that both the corresponding bits are not set to 1. so we can convincingly conclude that y1 is not in S. since BF has no false negatives. Then we check y2. We find that both the bits have been set to 1 by x1 and x2 respectively, as we have shown in the previous slide. Then we falsely claim that y2 is in the set S. So a false positive occurs. CHECK

Standard Bloom Filters: False Positive Rate (1)
Pr[a given bit in B is 0]= The probability of a false positive is Let r be the proportion of 0 bits after all elements are inserted in the Bloom filter Conditioned on r, the probability of a false positive is Now we calculate the false positive rate of the BF. Under the assumption that hash functions are uniformly random, we can see that the probability of a given bit in B is still zero after inserting n elements in S is p prime. This is not hard to see. Each element in S is hashed k times and there are n elements. We can consider that when an element is hashed once, one bit is chosen randomly and set to 1, and this will be done in kn times. 1-1/m is the probability that a bit is not chosen. So after kn times, a given bit is still 0 with probability p’. We usually use the approximation p in place of the exact value p’ because it is a very accurate approximation and it can make things simpler when we do the math. Then the false positive rate is the probability that an element not in S is accepted by the BF. This will happen if and only if all k corresponding bits are set to 1. So the false positive rate is f’, and we can use f as an approximation. Here the false positive rate is in an average sense. By average I mean we randomly choose the set S of size n in the universe, then consider the false positive rate. Now let us consider a conditional false positive rate. Let rho be the proportion of 0 bits after all elements are inserted in the BF. If the set S is selected at random, then we have that the expected value of rho is p prime. If we condition on rho, i.e. rho is already known, then an element not in S is hashed to a 1 bit with probability 1-rho. So conditioned on rho, the false positive rate is 1-rho over k. The question is why do we need conditional false positive rate? If rho has a large variation, then the false positive rate also has a large variation. This means the false positive rate heavily depends on the set S inserted in the BF. Then the average false positive rate will not give us too much insight.

Standard Bloom Filters: False Positive Rate (2)
The fraction of 0 bits is extremely concentrated around its expectation Therefore, with high probability, However, theoretical results show that rho is close to its mean value with high probability. So we can have this approximation. This is true with high probability.

Standard Bloom Filters: Optimal Number of Hash Functions (1)
Two competing forces: More hash functions gives more chances to find a 0 bit for an element that is not a member of S Fewer hash functions increases the fraction of 0 bits in the array Now we consider the optimal number of hash functions. Actually there are two competing forces. Firstly, if we have more hash functions, we can have more more chances to find a 0 bit for an element that is not a member of S, since BF accepts an element if only all corresponding values are set to 1. secondly fewer hash functions can increase the fraction of 0 bits in the array since we may set fewer bit to 1 when inserting an element. So there should be an optimal value.

This is a numerical example. M is the length of the BF, n is the size of the set S. From the graph, we can see that at some point, the false positive rate is minimized. In this case, the optimal k is about 5.5

Note that Let g=kln(1-e-kn/m) , solve Rewrite g as where p Using symmetry, g is minimal when p = ½ Then We can solve the optimal number of hash functions using calculus. We consider the false positive rate f. we can get rid of the exponential and consider g instead. Since the exponential function is increasing, so once g is minimized, f is minimized. Then we take the derivative of g with respect to k and solve this equation for k. it can be shown that k is the optimal nuber of hash functions. But we can avoid doing calculus by exploiting the symmetry of g. g can be rewrite as follows where p is the probability that a specific bit is still 0. according to symmetry, we can guess that g is minimized when p equals 1 over 2. this is true actually. At this time k=ln2 times m over n.

Standard Bloom Filters: Space Efficiency
A lower bound Let e be the false positive ratio, then The optimal case The false posive rate for the optimal Bloom filter is Let f>e Now we consider how many bits a BF need if we give a threshold of false positive rate. Using some tricks, we can get a lower bound. We can consider a BF as a mapping from a set of size n to an m bits binary string. If we consider the optimal BF. We can see this is the false positive rate f. let f be less the epsilon, then we can have this equation. So the optimal BF is within a factor of 1.44 of the lower bound.

Standard Bloom Filters: Operations (1)
Union Build a Bloom filter representing the union of A and B by taking the OR of BF(A) and BF(B) Shrinking a Bloom filter Halving the size by taking the OR of the first and the second half of the Bloom filter Increase false positive rate The intersection of two sets Let us look at the operations of BF. Some operations can be easily done in BF. If we have two BF which representing set A and B respectively. Then the bloom filter representing the union of A and B is the OR of the corresponding two Bloom filters. And if you want to reduce the size of the BF, you can take the OR of the first and the second half of the BF. Then you can have a half-sized BF, but the false positive rate will be increased. It seems that there is no simple ways to generate a BF representing the intersection of two sets from their BFs. Note that simply taking the AND of the two BFs does not make the BF that represents the intersection set. But we still find ways to estimate the size of their intersection set.

Standard Bloom Filters: Operations (2)
The intersection of S1 and S2 The average number of 1 bits in the AND of BF(S1) and BF(S2) Z1 the number of 0 bits in BF(S1), Z2 BF(S2), Z12 the AND of BF(S1) and BF(S2) So we can take the AND of the two BFs and count the number of 1 bits. Let this number be equal to the average number. And we can solve the equation for the size of the intersection set. Then we can have the estimation of the size.

Counting Bloom Filters: Motivation
Standard Bloom filters Easy to insert elements Cannot perform deletion operations Counting Bloom filters Each entry is not a single bit but a small counter Insert an element: increment the corresponding counters Delete an element: decrement the corresponding counters So far, we have talked about standard BF. Standard BF is easy to insert elements. You just set the corresponding bits to one. But if the set of elements is changing over time, We may need to delete old elements. for example, the content in a web cache. Unpopular items are deleted and popular items are inserted. But we cannot perform deletion in Standard BF. Once you set a bit to 0, it is possible that some other elements are also hashed to this bit. Then the BF is not correct. To address this problem, the idea of counting BF is introduced. We will talk about the applications of counting BF later. Now let us look at how it works. In a counting BF,

Counting Bloom Filters: An Example
B Now we consider an example of how BF works. From this example, we can see how a false positive in BF happens. Assume that the BF has 6 bits. Initially they are set to 0. there are two hash functions. INTIAL STATE

1 2 1 1 B Now we insert two elements in S into the BF. X1 is mapped two times by the two hash functions and the corresponding bits are set to 1. Then x2. INSERTION

1 1 2 1 B Now using the BF, we check two elements y1, y2, whether they are in set S. y1 is hashed two times, and we find that both the corresponding bits are not set to 1. so we can convincingly conclude that y1 is not in S. since BF has no false negatives. Then we check y2. We find that both the bits have been set to 1 by x1 and x2 respectively, as we have shown in the previous slide. Then we falsely claim that y2 is in the set S. So a false positive occurs. DELETION

Countering Bloom Filters: How Large Counters Do We Need? (1)
n elements, k hash functions, m counters, and c(i) the count associated with the ith counter The tail probability is bounded by Then use the union bound again Then one will ask, how large counters do we need? If the counter is too small, then it will overflow after inserting some elements. If it is two large, then it will be a waste of resource. The value of the counter can be considered as the binomial random variable. The probability can be written as follows.

Countering Bloom Filters: How Large Counters Do We Need? (2)
4 bits per counter is enough The maximum counter value is O(log m) with high probability, and hence O(loglog m) bits are sufficient Let j = 3ln m/ lnln m If each counter has 4 bits, then the maximal value of counter is 15. because the value of the counter should be so the counting BF overflow when the maximal value of these counters is larger than 15. actually we can prove that the maximal counter value is O(log m) with high probability.

Historical Applications
Dictionaries Hyphenation programs UNIX spell-checkers Dictionary of unsuitable passwords Databases Semi-join operations Differential files Now let us look at the applications of BFs. In early days, memory is a scarce resource in computers. BFs are succinct representations of sets of items. So programm at that times used BF to reduce the usage of memory. In the dictionaries applications, BF are used to represent a dictionary of words. For example, in spell-checkers, BF is used to store all correct words. Memory is saved, but the implication of false positive is that the spell-checker is not 100% accurate. It is possible that you misspell a word, but it is accepted by the checker. BF are also popular in database applications. They are used to compute set intersection and set difference. Some peer2peer applications also use this idea and we will elaborate them later.

Distributed Caching: Scenario
Let us look at more recent applications. First is distributed caching. This work is from university of Wisconsin-Madison. It appeared in sigcomm in the paper consider the following scenario. When clients send a url request to the web, web proxies will check whether web cache has the desired web page. If so, web proxies will return the clients the desired web page rather than make a request to WEB. Using web caches can reduce latencies and reduce the workload of web

Distributed Caching: Summary Cache
Motivation Sharing of caches among Web proxies to reduce Web traffic and alleviate network bottlenecks Directly sharing lists of URLs has too much overhead Solution Use Bloom filters to reduce network traffic Use a counting Bloom filter to track cache contents Broadcast the corresponding standard Bloom filter to other proxies We can further improve the performance by sharing caches among web proxies. This can further reduce web traffic and alleviate network bottlenecks. Since one web proxy can check whether other proxies has the desired web page rather than directly sending the request to web. However, simply sharing lists of URLs will cause large communication overhead. The solution proposed by the paper is using BF to reduce network traffic. Since content in web caches are changing over time, so it is not suitable to use standard BF because it cannot perform deletion. Then the paper introduce counting BF and use it to track the content of cache. And broadcast the corresponding standard BF to other proxies, because others only need to know whether a particular item is in this web cache.

P2P/Overlay Networks: Content Delivery
Problem Peer A has a set of items SA, peer B has SB, B wants useful items from A (SA-SB) Solution B sends A its Bloom filter BF(B) A sends B its items that is not in SB according to BF(B) Implications of false positives Not all elements in SA-SB will be sent Redundant items (e.g. erasure coding) A large fraction of SA-SB is sufficient (not necessarily the entire set) Then we look at some applications in peer2peer networks. In content delivery, the problem can be formulated as follows.

P2P/Overlay Networks: Efficient P2P Keyword Searching (1)
Problem Peer A has a set of items SA, peer B has SB, A wants to determine Solution A sends B its Bloom filter BF(A) B sends A its items that appears to be in SA according to BF(A) B eliminates false positives and determines exactly Fewer bits transmitted than A sending the entire set SA Let us look at another application, efficient peer2peer keyword searching. In this application, documents have several keywords. Each peer node is attributed to one keyword. Documents with this keyword will be stored stores documents which have a same keyword. Clients will use for example, two keywords to search documents. Then to find documents with both keywords, we need to compute set intersection.

P2P/Overlay Networks: Efficient P2P Keyword Searching (2)
Server A Server B (2) BF(A) 3 4 6 1 2 3 4 3 4 5 6 SA SB Let us look at these procedure. A and B are servers storing the document ID lists for the keywords kA and kB. A and B are the sets of document IDs matching the keywords kA and kB. F(A) is a Bloom filter representation of A. Bloom filters help reduce the bandwidth requirement of "AND" queries. The gray box represents the Bloom filter F(A) of the set A. Note the false positive in the set B ^ F(A) that server sB sends back to server sA, which sA eliminates in order to send A ^ B to the client. 3 4 (1) request Client

Resource Routing (1) Network is in the form of a rooted tree
Nodes hold resources Each node keeps Bloom filters representing A unified list of resources that it holds or reachable through one of its children Individual lists of resources for it and each child. When receiving a request for a resource Check the unified list to see whether the node or its descendants hold the resource Yes: check the individual lists No: forward the request up the tree toward the root

Resources Routing (2) Let us see the animation. The red node asks for a resource in the blue node. Then first he will deliver the request to his father.

Conclusion Simple space-efficient representation of a set or a list that can handle membership queries Applications in numerous networking problem Bloom filter principle

THANK YOU!

Network Applications of Bloom Filters: A Survey

Similar presentations

Presentation on theme: "Network Applications of Bloom Filters: A Survey"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Applications of Bloom Filters: A Survey

Similar presentations

Presentation on theme: "Network Applications of Bloom Filters: A Survey"— Presentation transcript:

Similar presentations

About project

Feedback