Download presentation
Presentation is loading. Please wait.
1
Lecture 1: Bloom Filters
Chen Qian Department of Computer Engineering
2
Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 2
3
Hash Functions A hash function is any function that can be used to map data of arbitrary size to data of a fixed size 00 Hash function Map names to integers from 0 to 15 keys 01 John Smith 02 Lisa Smith 03 Sam Doe 04 Sandra Dee 05 collision ⋮ 3 15
4
Hash Functions Requirements for hash functions
Can be applied to any sized key 𝑥 Produces fixed-length output d Easy to compute d=ℎ(𝑥) One way property: given 𝑑 is infeasible to find ℎ 𝑥 =𝑑 Weak collision resistance: given x is infeasible to find y that makes ℎ 𝑦 =ℎ(𝑥) Strong collision resistance: it is infeasible to find any x, y to make ℎ 𝑥 =ℎ(𝑦) Uniformity: Every hash value in the output range should be generated with roughly the same probability 4 Red ones are required for cryptographic hashes
5
Examples of Hash functions
mod CRC-16, -32 Jenkins, CityHash, FarmHash, MetroHash, etc. MD5, SHA-1, MD6, SHA-2, SHA-256 CRCs are based on the theory of cyclic error-correcting codes 5
6
Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 6
7
Bloom Filters Membership query
A Bloom filter is a simple space-efficient randomized data structure Represent a set in order to support membership queries Burton Bloom introduced Bloom filters in the 1970s C A E B Z ⋯ Element P is in the S? P set S 7
8
Bloom Filters Standard Bloom filters:
Represent a set 𝑆={ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 } by an array of m bits Use k independent hash functions ℎ 1 , ℎ 2 ,…, ℎ 𝑘 assumption Hash functions map each item to a random number uniformly over the range {1,2,…,𝑚} How to construct the bloom filters 8
9
Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3
𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 9
10
Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3
𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 1 10
11
Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2
𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 11
12
Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2
𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 12
13
Bloom Filters m bits array Initially all set to 0 assume k is 3 𝑥 1
assume k is 3 𝑥 1 𝑥 2 Insert 𝑥 1 , 𝑥 2 1 𝑦 1 𝑥 2 Query 𝑥 2 , 𝑦 1 (alien element) All ones, 𝑥 2 is a member 1 𝑦 1 is not a member 13
14
Bloom Filters False positive False negative 14
15
Bloom Filters Have false positive No false negative
There is a clear tradeoff between m and the probability of a false positive 𝑥 1 𝑦 𝑥 2 Alien element: y y is a member 1 15
16
Bloom Filters n keys, m bits array, k hash functions
After inserting n keys, let p denotes the probability that a particular bit is 0 𝑝= (1− 1 𝑚 ) 𝑘𝑛 ≈ 𝑒 −𝑘∗ 𝑛 𝑚 Let f denotes the probability of false positive 𝑓= (1−𝑝) 𝑘 = (1− (1− 1 𝑚 ) 𝑘𝑛 ) 𝑘 When 𝑘=𝑙𝑛2 ∗ 𝑚 𝑛 , f achieves optimal value 16
17
Bloom Filters No deletion 1 𝑥 1 𝑥 2 delete 𝑥 1 17
18
Bloom Filters No deletion 1 𝑥 2 𝑥 2 is not a member delete 𝑥 1 18
19
Bloom Filters Features Low storage requirement
Fast membership checking No false negative Low false positive probability No deletion is allowed 19
20
Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 20
21
Counting Bloom Filters
Allow deletion Each entry in the counting Bloom filters is a small counter l bits per counter 𝑥 1 insert 𝑥 1 , 𝑥 2 1 21
22
Counting Bloom Filters
Each entry in the Bloom filters is a small counter l bits per counter 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 1 22
23
Counting Bloom Filters
Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 23
24
Counting Bloom Filters
Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 𝑥 2 delete 𝑥 1 1 24
25
Counting Bloom Filters
The size of the bits for the counter 𝑝 𝑐 𝑖 =𝑗 = 𝑛𝑘 𝑗 ( 1 𝑚 ) 𝑗 (1− 1 𝑚 ) 𝑛𝑘−𝑗 𝑝(𝑐(𝑖)≥𝑗)≤ 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤ ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The probability that the any counter is at least 𝑗 is bounded by 𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max(𝑐(𝑖))≥𝑗)≤𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max(𝑐(𝑖))≥𝑗)≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The count associated with the i-th counter loose bound 25
26
Counting Bloom Filters
𝑝 max 𝑐 𝑖 ≥𝑗 ≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 𝑒𝑛𝑘 𝑗𝑚 𝑗 Let 𝑘≤ 𝑙𝑛2 𝑚 𝑛 𝑝(max(𝑐(𝑖))≥𝑗)≤𝑚 ( 𝑒𝑙𝑛2 𝑗 ) 𝑗 𝑝 max 𝑐 𝑖 ≥16 ≤1.37∗ 10 −15 ∗𝑚 If we allow 4 bits per counter, the probability of overflow for practical values of m during the initial insertion in the CBF is minuscule. 26
27
Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 27
28
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 28
29
Background and Problems
Caching has been recognized as one of the most important techniques to reduce bandwidth consumption with the development of World Wide Web. Users Proxy Caches Bottleneck ⋯ Rest of the Internet ⋯ 29
30
Background and Problems
Proxy caches should cooperate and serve each other’s misses. If the object is missed in current proxy cache, we can get the object from other proxy caches How do we know which proxy caches hold the requested objects that miss in current proxy cache? This process is called “Web Cache Sharing” Internet cache protocol (ICP) is one of the protocol for web cache sharing 30
31
Overview Background and Problems Internet cache protocol (ICP)
Summary cache Evaluation 31
32
Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2 Miss, ICP query Parent Web server 2 Miss, ICP query Proxy Cache Proxy Cache 2 Miss, ICP query Proxy Cache 1. web page request Sibling 2 Hit, return result 32 Sibling Client
33
Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.1 ICP miss Parent Web server 2.1 ICP miss Proxy Cache Proxy Cache 2.1 ICP miss Proxy Cache 1. web page request Sibling 33 Sibling Client
34
Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 3. web page request Parent Web server Proxy Cache Proxy Cache Proxy Cache Sibling return result 34 Sibling Client
35
Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.2 ICP hit Parent Web server 2.2 ICP hit Proxy Cache Proxy Cache 2.2 ICP miss Proxy Cache 1. web page request Sibling 35 Sibling Client
36
Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache Parent Web server 3. Web page request Proxy Cache Proxy Cache Proxy Cache Sibling return result 36 Sibling Client
37
Overhead of ICP ICP is not a scalable protocol
ICP relies on query messages to find cache hits in other proxies Each proxy cache needs to handle 𝑁−1 ∗ 1−𝐻 ∗𝑅 inquiries from neighboring caches N: Number of proxy caches H: Hit rate R: Average requests As N increases, the overhead quickly becomes prohibitive 37
38
Overhead of ICP 4 proxy caches
Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% (5%) 24% (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% Exp2 45% 2.21 (1%) 80.83 (2%) (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% (1%) 7% 39968 (0%) 7300% 257K (2%) -1% 314K (1%) ICP incurs considerable overhead even when the number of proxies is as low as four. 38
39
Benefits of Web Cache Sharing
Use simulation results to illustrate. Compare four different cooperation schemes No cache sharing Proxies do not collaborate Simple cache sharing (ICP-style) Do not coordinate cache replacement (each LRU) Client cache D locally request D A B C … A B C D … return D D E B … request D Proxy Cache 1 Proxy Cache 2 return D 39 miss
40
Benefits of Web Cache Sharing
Compare four different cooperation schemes Single-copy cache sharing Mark D as most-recently-accessed & Increase its caching priority Client request D A B C … A B C return D D E B … request D Proxy Cache 1 Proxy Cache 2 return D miss Compared to simple cache sharing, this scheme eliminates the storage of duplicate copies, increases the utilization of available cache space. 40
41
Benefits of Web Cache Sharing
Compare four different cooperation schemes Global cache Client share cache contents request D return D A B C D E … A B C … D E B A C … D E B … Proxy Cache 1 Proxy Cache 2 Coordinate replacement 41 C is replaced by F from the cache
42
Benefits of Web Cache Sharing
Compare four different cooperation schemes Global cache Client share cache contents request D return D A B F D E … D E B A F … Proxy Cache 1 Proxy Cache 2 Coordinate replacement We can regard global cache as one unified cache with global LRU replacement to the users 42
43
Benefits of Web Cache Sharing
Use simulation results to illustrate Compare four different cooperation schemes Collect five sets of traces of HTTP requests DEC, UCB, Upisa, Questnet, NLANR 43
44
Benefits of Web Cache Sharing
Simulation result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache 44
45
Benefits of Web Cache Sharing
Simulation Result All cache sharing schemes significantly improve the hit ratio over no cache sharing Hit ratio under single-copy cache sharing and simple cache sharing are generally the same A waste of memory has only a small effect A smaller effective cache does not make a significant difference in the hit ratio 45
46
Benefits of Web Cache Sharing
Simulation Result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache Global LRU sometimes performs less well than group-wise LRU In global cache setting, a burst of rapid successive requests from one user might disturb the working set of many other users Conclusion ICP-style simple cache sharing obtains most of the benefits of more elaborate cooperative caching. 46
47
Dilemma Summary Cache Benefits of ICP High overhead of ICP 47
48
Overview Background and Problems Internet cache protocol (ICP)
Summary cache Evaluation 48
49
Summary Cache Basic idea Two tolerate errors
Each proxy stores a summary of directory of cached documents in every other proxies When a user request misses in the local cache, find proxies that store the requested document with summaries Two tolerate errors False misses: summary does not reflect that the document is cached at some other proxies False hit: summary wrongly indicates that the document is cached at some other proxies 49
50
Summary Cache Challenges of scalability Naive summary representations
Network overhead: interproxy traffic Memory to store summaries: summaries need to be stored in DRAM for performance reasons Naive summary representations Exact-directory approach: cache directory, each URL is represented by its 16-byte MD5 signature Problem: Consume too much memory E.g. 16 proxies of 8GB each and an average file size of 8KB, the exact-directory approach will consume 16−1 ∗16∗ 8𝐺𝐵 8𝐾𝐵 𝑏𝑦𝑡𝑒=240𝑀𝐵 50
51
Summary Cache Summary representations in summary cache Good news
Bloom filter Memory-efficient and with low false hits Good news Summaries do not have to be up-to-date or accurate Summaries do not have to be updated every time the cache directory is changed The update can occur upon regular time intervals Use an experiment to illustrate it Delaying the update of summaries until percentage of cached documents that are new reaches a threshold 51
52
Bloom Filter as Summaries
Construction A proxy builds a Bloom filter from the list of URL’s of cached documents A proxy sends the Bloom filter plus the corresponding hash functions to other proxies Update Use counting Bloom filter to update the local filter Specify which bits in the bit array are flipped or send the whole array to update Bloom filter in other proxies 52
53
Overview Background and Problems Internet cache protocol (ICP)
Summary cache Evaluation 53
54
Evaluation Three configurations (m denotes the size of Bloom filter array) Bloom_filter_8/16/32: m is 8/16/32 times the average number of documents in the cache Four hash functions First calculate the MD5 signature of the URL (128 bits) Divides 128 bits into four 32-bit word Take the modulus of each 32-bit word by m 54
55
Evaluation Memory consumption Low memory consumption Approach DEC
NLANR Exact_dir 2.8% 0.70% Server_name 0.19% 0.08% Bloom_filter_8 0.038% Bloom_filter_16 0.38% 0.075% Bloom_filter_32 0.75% 0.15% Low memory consumption 55
56
Evaluation Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs
TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% (5%) 24% (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% SC-ICP 2.85 (1%) 4% 95.07 (6%) 0.7% (6%) 1079 (0%) 75% 330K (5%) -1% 351K (5%) Exp2 45% 2.21 (1%) 80.83 (2%) (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% (1%) 7% 39968 (0%) 7300% 257K (2%) 314K (1%) 2.25 (1%) 2% 82.03 (3%) 1% (3%) 799 (5%) 48% 269K (5%) 287K (5%) 56
57
Conclusion Summary Cache enhanced ICP
Reduces the number of interproxy protocol message by factor of 25 to 60 Reduces the bandwidth consumption by over 50% Almost no degradation in the cache hit ratios Reduces CPU overhead between 30% to 95% 57
58
Chen Qian cqian12@ucsc.edu https://users.soe.ucsc.edu/~qian/
Thank You Chen Qian
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.