Lecture 1: Bloom Filters Chen Qian Department of Computer Engineering qian@ucsc.edu https://users.soe.ucsc.edu/~qian/
Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 2
Hash Functions A hash function is any function that can be used to map data of arbitrary size to data of a fixed size 00 Hash function Map names to integers from 0 to 15 keys 01 John Smith 02 Lisa Smith 03 Sam Doe 04 Sandra Dee 05 collision ⋮ 3 15
Hash Functions Requirements for hash functions Can be applied to any sized key 𝑥 Produces fixed-length output d Easy to compute d=ℎ(𝑥) One way property: given 𝑑 is infeasible to find ℎ 𝑥 =𝑑 Weak collision resistance: given x is infeasible to find y that makes ℎ 𝑦 =ℎ(𝑥) Strong collision resistance: it is infeasible to find any x, y to make ℎ 𝑥 =ℎ(𝑦) Uniformity: Every hash value in the output range should be generated with roughly the same probability 4 Red ones are required for cryptographic hashes
Examples of Hash functions mod CRC-16, -32 Jenkins, CityHash, FarmHash, MetroHash, etc. MD5, SHA-1, MD6, SHA-2, SHA-256 CRCs are based on the theory of cyclic error-correcting codes 5
Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 6
Bloom Filters Membership query A Bloom filter is a simple space-efficient randomized data structure Represent a set in order to support membership queries Burton Bloom introduced Bloom filters in the 1970s C A E B Z ⋯ Element P is in the S? P set S 7
Bloom Filters Standard Bloom filters: Represent a set 𝑆={ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 } by an array of m bits Use k independent hash functions ℎ 1 , ℎ 2 ,…, ℎ 𝑘 assumption Hash functions map each item to a random number uniformly over the range {1,2,…,𝑚} How to construct the bloom filters 8
Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3 𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 9
Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3 𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 1 10
Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2 𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 11
Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2 𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 12
Bloom Filters m bits array Initially all set to 0 assume k is 3 𝑥 1 assume k is 3 𝑥 1 𝑥 2 Insert 𝑥 1 , 𝑥 2 1 𝑦 1 𝑥 2 Query 𝑥 2 , 𝑦 1 (alien element) All ones, 𝑥 2 is a member 1 𝑦 1 is not a member 13
Bloom Filters False positive False negative 14
Bloom Filters Have false positive No false negative There is a clear tradeoff between m and the probability of a false positive 𝑥 1 𝑦 𝑥 2 Alien element: y y is a member 1 15
Bloom Filters n keys, m bits array, k hash functions After inserting n keys, let p denotes the probability that a particular bit is 0 𝑝= (1− 1 𝑚 ) 𝑘𝑛 ≈ 𝑒 −𝑘∗ 𝑛 𝑚 Let f denotes the probability of false positive 𝑓= (1−𝑝) 𝑘 = (1− (1− 1 𝑚 ) 𝑘𝑛 ) 𝑘 When 𝑘=𝑙𝑛2 ∗ 𝑚 𝑛 , f achieves optimal value 16
Bloom Filters No deletion 1 𝑥 1 𝑥 2 delete 𝑥 1 17
Bloom Filters No deletion 1 𝑥 2 𝑥 2 is not a member delete 𝑥 1 18
Bloom Filters Features Low storage requirement Fast membership checking No false negative Low false positive probability No deletion is allowed 19
Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 20
Counting Bloom Filters Allow deletion Each entry in the counting Bloom filters is a small counter l bits per counter 𝑥 1 insert 𝑥 1 , 𝑥 2 1 21
Counting Bloom Filters Each entry in the Bloom filters is a small counter l bits per counter 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 1 22
Counting Bloom Filters Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 23
Counting Bloom Filters Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 𝑥 2 delete 𝑥 1 1 24
Counting Bloom Filters The size of the bits for the counter 𝑝 𝑐 𝑖 =𝑗 = 𝑛𝑘 𝑗 ( 1 𝑚 ) 𝑗 (1− 1 𝑚 ) 𝑛𝑘−𝑗 𝑝(𝑐(𝑖)≥𝑗)≤ 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤ ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The probability that the any counter is at least 𝑗 is bounded by 𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max(𝑐(𝑖))≥𝑗)≤𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max(𝑐(𝑖))≥𝑗)≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The count associated with the i-th counter loose bound 25
Counting Bloom Filters 𝑝 max 𝑐 𝑖 ≥𝑗 ≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 𝑒𝑛𝑘 𝑗𝑚 𝑗 Let 𝑘≤ 𝑙𝑛2 𝑚 𝑛 𝑝(max(𝑐(𝑖))≥𝑗)≤𝑚 ( 𝑒𝑙𝑛2 𝑗 ) 𝑗 𝑝 max 𝑐 𝑖 ≥16 ≤1.37∗ 10 −15 ∗𝑚 If we allow 4 bits per counter, the probability of overflow for practical values of m during the initial insertion in the CBF is minuscule. 26
Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 27
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 28
Background and Problems Caching has been recognized as one of the most important techniques to reduce bandwidth consumption with the development of World Wide Web. Users Proxy Caches Bottleneck ⋯ Rest of the Internet ⋯ 29
Background and Problems Proxy caches should cooperate and serve each other’s misses. If the object is missed in current proxy cache, we can get the object from other proxy caches How do we know which proxy caches hold the requested objects that miss in current proxy cache? This process is called “Web Cache Sharing” Internet cache protocol (ICP) is one of the protocol for web cache sharing 30
Overview Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 31
Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2 Miss, ICP query Parent Web server 2 Miss, ICP query Proxy Cache Proxy Cache 2 Miss, ICP query Proxy Cache 1. web page request Sibling 2 Hit, return result 32 Sibling Client
Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.1 ICP miss Parent Web server 2.1 ICP miss Proxy Cache Proxy Cache 2.1 ICP miss Proxy Cache 1. web page request Sibling 33 Sibling Client
Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 3. web page request Parent Web server Proxy Cache Proxy Cache Proxy Cache Sibling return result 34 Sibling Client
Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.2 ICP hit Parent Web server 2.2 ICP hit Proxy Cache Proxy Cache 2.2 ICP miss Proxy Cache 1. web page request Sibling 35 Sibling Client
Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache Parent Web server 3. Web page request Proxy Cache Proxy Cache Proxy Cache Sibling return result 36 Sibling Client
Overhead of ICP ICP is not a scalable protocol ICP relies on query messages to find cache hits in other proxies Each proxy cache needs to handle 𝑁−1 ∗ 1−𝐻 ∗𝑅 inquiries from neighboring caches N: Number of proxy caches H: Hit rate R: Average requests As N increases, the overhead quickly becomes prohibitive 37
Overhead of ICP 4 proxy caches Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) 133.65 (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% 116.87 (5%) 24% 146.50 (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% Exp2 45% 2.21 (1%) 80.83 (2%) 111.10 (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% 118.59 (1%) 7% 39968 (0%) 7300% 257K (2%) -1% 314K (1%) ICP incurs considerable overhead even when the number of proxies is as low as four. 38
Benefits of Web Cache Sharing Use simulation results to illustrate. Compare four different cooperation schemes No cache sharing Proxies do not collaborate Simple cache sharing (ICP-style) Do not coordinate cache replacement (each LRU) Client cache D locally request D A B C … A B C D … return D D E B … request D Proxy Cache 1 Proxy Cache 2 return D 39 miss
Benefits of Web Cache Sharing Compare four different cooperation schemes Single-copy cache sharing Mark D as most-recently-accessed & Increase its caching priority Client request D A B C … A B C return D D E B … request D Proxy Cache 1 Proxy Cache 2 return D miss Compared to simple cache sharing, this scheme eliminates the storage of duplicate copies, increases the utilization of available cache space. 40
Benefits of Web Cache Sharing Compare four different cooperation schemes Global cache Client share cache contents request D return D A B C D E … A B C … D E B A C … D E B … Proxy Cache 1 Proxy Cache 2 Coordinate replacement 41 C is replaced by F from the cache
Benefits of Web Cache Sharing Compare four different cooperation schemes Global cache Client share cache contents request D return D A B F D E … D E B A F … Proxy Cache 1 Proxy Cache 2 Coordinate replacement We can regard global cache as one unified cache with global LRU replacement to the users 42
Benefits of Web Cache Sharing Use simulation results to illustrate Compare four different cooperation schemes Collect five sets of traces of HTTP requests DEC, UCB, Upisa, Questnet, NLANR 43
Benefits of Web Cache Sharing Simulation result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache 44
Benefits of Web Cache Sharing Simulation Result All cache sharing schemes significantly improve the hit ratio over no cache sharing Hit ratio under single-copy cache sharing and simple cache sharing are generally the same A waste of memory has only a small effect A smaller effective cache does not make a significant difference in the hit ratio 45
Benefits of Web Cache Sharing Simulation Result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache Global LRU sometimes performs less well than group-wise LRU In global cache setting, a burst of rapid successive requests from one user might disturb the working set of many other users Conclusion ICP-style simple cache sharing obtains most of the benefits of more elaborate cooperative caching. 46
Dilemma Summary Cache Benefits of ICP High overhead of ICP 47
Overview Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 48
Summary Cache Basic idea Two tolerate errors Each proxy stores a summary of directory of cached documents in every other proxies When a user request misses in the local cache, find proxies that store the requested document with summaries Two tolerate errors False misses: summary does not reflect that the document is cached at some other proxies False hit: summary wrongly indicates that the document is cached at some other proxies 49
Summary Cache Challenges of scalability Naive summary representations Network overhead: interproxy traffic Memory to store summaries: summaries need to be stored in DRAM for performance reasons Naive summary representations Exact-directory approach: cache directory, each URL is represented by its 16-byte MD5 signature Problem: Consume too much memory E.g. 16 proxies of 8GB each and an average file size of 8KB, the exact-directory approach will consume 16−1 ∗16∗ 8𝐺𝐵 8𝐾𝐵 𝑏𝑦𝑡𝑒=240𝑀𝐵 50
Summary Cache Summary representations in summary cache Good news Bloom filter Memory-efficient and with low false hits Good news Summaries do not have to be up-to-date or accurate Summaries do not have to be updated every time the cache directory is changed The update can occur upon regular time intervals Use an experiment to illustrate it Delaying the update of summaries until percentage of cached documents that are new reaches a threshold 51
Bloom Filter as Summaries Construction A proxy builds a Bloom filter from the list of URL’s of cached documents A proxy sends the Bloom filter plus the corresponding hash functions to other proxies Update Use counting Bloom filter to update the local filter Specify which bits in the bit array are flipped or send the whole array to update Bloom filter in other proxies 52
Overview Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 53
Evaluation Three configurations (m denotes the size of Bloom filter array) Bloom_filter_8/16/32: m is 8/16/32 times the average number of documents in the cache Four hash functions First calculate the MD5 signature of the URL (128 bits) Divides 128 bits into four 32-bit word Take the modulus of each 32-bit word by m 54
Evaluation Memory consumption Low memory consumption Approach DEC NLANR Exact_dir 2.8% 0.70% Server_name 0.19% 0.08% Bloom_filter_8 0.038% Bloom_filter_16 0.38% 0.075% Bloom_filter_32 0.75% 0.15% Low memory consumption 55
Evaluation Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) 133.65 (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% 116.87 (5%) 24% 146.50 (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% SC-ICP 2.85 (1%) 4% 95.07 (6%) 0.7% 134.61 (6%) 1079 (0%) 75% 330K (5%) -1% 351K (5%) Exp2 45% 2.21 (1%) 80.83 (2%) 111.10 (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% 118.59 (1%) 7% 39968 (0%) 7300% 257K (2%) 314K (1%) 2.25 (1%) 2% 82.03 (3%) 1% 111.87 (3%) 799 (5%) 48% 269K (5%) 287K (5%) 56
Conclusion Summary Cache enhanced ICP Reduces the number of interproxy protocol message by factor of 25 to 60 Reduces the bandwidth consumption by over 50% Almost no degradation in the cache hit ratios Reduces CPU overhead between 30% to 95% 57
Chen Qian cqian12@ucsc.edu https://users.soe.ucsc.edu/~qian/ Thank You Chen Qian cqian12@ucsc.edu https://users.soe.ucsc.edu/~qian/