Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1: Bloom Filters

Similar presentations


Presentation on theme: "Lecture 1: Bloom Filters"— Presentation transcript:

1 Lecture 1: Bloom Filters
Chen Qian Department of Computer Engineering

2 Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 2

3 Hash Functions A hash function is any function that can be used to map data of arbitrary size to data of a fixed size 00 Hash function Map names to integers from 0 to 15 keys 01 John Smith 02 Lisa Smith 03 Sam Doe 04 Sandra Dee 05 collision 3 15

4 Hash Functions Requirements for hash functions
Can be applied to any sized key 𝑥 Produces fixed-length output d Easy to compute d=ℎ(𝑥) One way property: given 𝑑 is infeasible to find ℎ 𝑥 =𝑑 Weak collision resistance: given x is infeasible to find y that makes ℎ 𝑦 =ℎ(𝑥) Strong collision resistance: it is infeasible to find any x, y to make ℎ 𝑥 =ℎ(𝑦) Uniformity: Every hash value in the output range should be generated with roughly the same probability 4 Red ones are required for cryptographic hashes

5 Examples of Hash functions
mod CRC-16, -32 Jenkins, CityHash, FarmHash, MetroHash, etc. MD5, SHA-1, MD6, SHA-2, SHA-256 CRCs are based on the theory of cyclic error-correcting codes 5

6 Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 6

7 Bloom Filters Membership query
A Bloom filter is a simple space-efficient randomized data structure Represent a set in order to support membership queries Burton Bloom introduced Bloom filters in the 1970s C A E B Z Element P is in the S? P set S 7

8 Bloom Filters Standard Bloom filters:
Represent a set 𝑆={ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 } by an array of m bits Use k independent hash functions ℎ 1 , ℎ 2 ,…, ℎ 𝑘 assumption Hash functions map each item to a random number uniformly over the range {1,2,…,𝑚} How to construct the bloom filters 8

9 Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3
𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 9

10 Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3
𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 1 10

11 Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2
𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 11

12 Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2
𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 12

13 Bloom Filters m bits array Initially all set to 0 assume k is 3 𝑥 1
assume k is 3 𝑥 1 𝑥 2 Insert 𝑥 1 , 𝑥 2 1 𝑦 1 𝑥 2 Query 𝑥 2 , 𝑦 1 (alien element) All ones, 𝑥 2 is a member 1 𝑦 1 is not a member 13

14 Bloom Filters False positive False negative 14

15 Bloom Filters Have false positive No false negative
There is a clear tradeoff between m and the probability of a false positive 𝑥 1 𝑦 𝑥 2 Alien element: y y is a member 1 15

16 Bloom Filters n keys, m bits array, k hash functions
After inserting n keys, let p denotes the probability that a particular bit is 0 𝑝= (1− 1 𝑚 ) 𝑘𝑛 ≈ 𝑒 −𝑘∗ 𝑛 𝑚 Let f denotes the probability of false positive 𝑓= (1−𝑝) 𝑘 = (1− (1− 1 𝑚 ) 𝑘𝑛 ) 𝑘 When 𝑘=𝑙𝑛2 ∗ 𝑚 𝑛 , f achieves optimal value 16

17 Bloom Filters No deletion 1 𝑥 1 𝑥 2 delete 𝑥 1 17

18 Bloom Filters No deletion 1 𝑥 2 𝑥 2 is not a member delete 𝑥 1 18

19 Bloom Filters Features Low storage requirement
Fast membership checking No false negative Low false positive probability No deletion is allowed 19

20 Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 20

21 Counting Bloom Filters
Allow deletion Each entry in the counting Bloom filters is a small counter l bits per counter 𝑥 1 insert 𝑥 1 , 𝑥 2 1 21

22 Counting Bloom Filters
Each entry in the Bloom filters is a small counter l bits per counter 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 1 22

23 Counting Bloom Filters
Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 23

24 Counting Bloom Filters
Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 𝑥 2 delete 𝑥 1 1 24

25 Counting Bloom Filters
The size of the bits for the counter 𝑝 𝑐 𝑖 =𝑗 = 𝑛𝑘 𝑗 ( 1 𝑚 ) 𝑗 (1− 1 𝑚 ) 𝑛𝑘−𝑗 𝑝(𝑐(𝑖)≥𝑗)≤ 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤ ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The probability that the any counter is at least 𝑗 is bounded by 𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max⁡(𝑐(𝑖))≥𝑗)≤𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max⁡(𝑐(𝑖))≥𝑗)≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The count associated with the i-th counter loose bound 25

26 Counting Bloom Filters
𝑝 max 𝑐 𝑖 ≥𝑗 ≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 𝑒𝑛𝑘 𝑗𝑚 𝑗 Let 𝑘≤ 𝑙𝑛2 𝑚 𝑛 𝑝(max⁡(𝑐(𝑖))≥𝑗)≤𝑚 ( 𝑒𝑙𝑛2 𝑗 ) 𝑗 𝑝 max 𝑐 𝑖 ≥16 ≤1.37∗ 10 −15 ∗𝑚 If we allow 4 bits per counter, the probability of overflow for practical values of m during the initial insertion in the CBF is minuscule. 26

27 Outline Pre-Knowledge Bloom filters Counting Bloom filters
Applications 27

28 Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 28

29 Background and Problems
Caching has been recognized as one of the most important techniques to reduce bandwidth consumption with the development of World Wide Web. Users Proxy Caches Bottleneck ⋯ Rest of the Internet ⋯ 29

30 Background and Problems
Proxy caches should cooperate and serve each other’s misses. If the object is missed in current proxy cache, we can get the object from other proxy caches How do we know which proxy caches hold the requested objects that miss in current proxy cache? This process is called “Web Cache Sharing” Internet cache protocol (ICP) is one of the protocol for web cache sharing 30

31 Overview Background and Problems Internet cache protocol (ICP)
Summary cache Evaluation 31

32 Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2 Miss, ICP query Parent Web server 2 Miss, ICP query Proxy Cache Proxy Cache 2 Miss, ICP query Proxy Cache 1. web page request Sibling 2 Hit, return result 32 Sibling Client

33 Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.1 ICP miss Parent Web server 2.1 ICP miss Proxy Cache Proxy Cache 2.1 ICP miss Proxy Cache 1. web page request Sibling 33 Sibling Client

34 Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 3. web page request Parent Web server Proxy Cache Proxy Cache Proxy Cache Sibling return result 34 Sibling Client

35 Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.2 ICP hit Parent Web server 2.2 ICP hit Proxy Cache Proxy Cache 2.2 ICP miss Proxy Cache 1. web page request Sibling 35 Sibling Client

36 Internet Cache Protocol (ICP)
ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache Parent Web server 3. Web page request Proxy Cache Proxy Cache Proxy Cache Sibling return result 36 Sibling Client

37 Overhead of ICP ICP is not a scalable protocol
ICP relies on query messages to find cache hits in other proxies Each proxy cache needs to handle 𝑁−1 ∗ 1−𝐻 ∗𝑅 inquiries from neighboring caches N: Number of proxy caches H: Hit rate R: Average requests As N increases, the overhead quickly becomes prohibitive 37

38 Overhead of ICP 4 proxy caches
Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% (5%) 24% (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% Exp2 45% 2.21 (1%) 80.83 (2%) (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% (1%) 7% 39968 (0%) 7300% 257K (2%) -1% 314K (1%) ICP incurs considerable overhead even when the number of proxies is as low as four. 38

39 Benefits of Web Cache Sharing
Use simulation results to illustrate. Compare four different cooperation schemes No cache sharing Proxies do not collaborate Simple cache sharing (ICP-style) Do not coordinate cache replacement (each LRU) Client cache D locally request D A B C A B C D return D D E B request D Proxy Cache 1 Proxy Cache 2 return D 39 miss

40 Benefits of Web Cache Sharing
Compare four different cooperation schemes Single-copy cache sharing Mark D as most-recently-accessed & Increase its caching priority Client request D A B C A B C return D D E B request D Proxy Cache 1 Proxy Cache 2 return D miss Compared to simple cache sharing, this scheme eliminates the storage of duplicate copies, increases the utilization of available cache space. 40

41 Benefits of Web Cache Sharing
Compare four different cooperation schemes Global cache Client share cache contents request D return D A B C D E A B C D E B A C D E B Proxy Cache 1 Proxy Cache 2 Coordinate replacement 41 C is replaced by F from the cache

42 Benefits of Web Cache Sharing
Compare four different cooperation schemes Global cache Client share cache contents request D return D A B F D E D E B A F Proxy Cache 1 Proxy Cache 2 Coordinate replacement We can regard global cache as one unified cache with global LRU replacement to the users 42

43 Benefits of Web Cache Sharing
Use simulation results to illustrate Compare four different cooperation schemes Collect five sets of traces of HTTP requests DEC, UCB, Upisa, Questnet, NLANR 43

44 Benefits of Web Cache Sharing
Simulation result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache 44

45 Benefits of Web Cache Sharing
Simulation Result All cache sharing schemes significantly improve the hit ratio over no cache sharing Hit ratio under single-copy cache sharing and simple cache sharing are generally the same A waste of memory has only a small effect A smaller effective cache does not make a significant difference in the hit ratio 45

46 Benefits of Web Cache Sharing
Simulation Result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache Global LRU sometimes performs less well than group-wise LRU In global cache setting, a burst of rapid successive requests from one user might disturb the working set of many other users Conclusion ICP-style simple cache sharing obtains most of the benefits of more elaborate cooperative caching. 46

47 Dilemma Summary Cache Benefits of ICP High overhead of ICP 47

48 Overview Background and Problems Internet cache protocol (ICP)
Summary cache Evaluation 48

49 Summary Cache Basic idea Two tolerate errors
Each proxy stores a summary of directory of cached documents in every other proxies When a user request misses in the local cache, find proxies that store the requested document with summaries Two tolerate errors False misses: summary does not reflect that the document is cached at some other proxies False hit: summary wrongly indicates that the document is cached at some other proxies 49

50 Summary Cache Challenges of scalability Naive summary representations
Network overhead: interproxy traffic Memory to store summaries: summaries need to be stored in DRAM for performance reasons Naive summary representations Exact-directory approach: cache directory, each URL is represented by its 16-byte MD5 signature Problem: Consume too much memory E.g. 16 proxies of 8GB each and an average file size of 8KB, the exact-directory approach will consume 16−1 ∗16∗ 8𝐺𝐵 8𝐾𝐵 𝑏𝑦𝑡𝑒=240𝑀𝐵 50

51 Summary Cache Summary representations in summary cache Good news
Bloom filter Memory-efficient and with low false hits Good news Summaries do not have to be up-to-date or accurate Summaries do not have to be updated every time the cache directory is changed The update can occur upon regular time intervals Use an experiment to illustrate it Delaying the update of summaries until percentage of cached documents that are new reaches a threshold 51

52 Bloom Filter as Summaries
Construction A proxy builds a Bloom filter from the list of URL’s of cached documents A proxy sends the Bloom filter plus the corresponding hash functions to other proxies Update Use counting Bloom filter to update the local filter Specify which bits in the bit array are flipped or send the whole array to update Bloom filter in other proxies 52

53 Overview Background and Problems Internet cache protocol (ICP)
Summary cache Evaluation 53

54 Evaluation Three configurations (m denotes the size of Bloom filter array) Bloom_filter_8/16/32: m is 8/16/32 times the average number of documents in the cache Four hash functions First calculate the MD5 signature of the URL (128 bits) Divides 128 bits into four 32-bit word Take the modulus of each 32-bit word by m 54

55 Evaluation Memory consumption Low memory consumption Approach DEC
NLANR Exact_dir 2.8% 0.70% Server_name 0.19% 0.08% Bloom_filter_8 0.038% Bloom_filter_16 0.38% 0.075% Bloom_filter_32 0.75% 0.15% Low memory consumption 55

56 Evaluation Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs
TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% (5%) 24% (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% SC-ICP 2.85 (1%) 4% 95.07 (6%) 0.7% (6%) 1079 (0%) 75% 330K (5%) -1% 351K (5%) Exp2 45% 2.21 (1%) 80.83 (2%) (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% (1%) 7% 39968 (0%) 7300% 257K (2%) 314K (1%) 2.25 (1%) 2% 82.03 (3%) 1% (3%) 799 (5%) 48% 269K (5%) 287K (5%) 56

57 Conclusion Summary Cache enhanced ICP
Reduces the number of interproxy protocol message by factor of 25 to 60 Reduces the bandwidth consumption by over 50% Almost no degradation in the cache hit ratios Reduces CPU overhead between 30% to 95% 57

58 Chen Qian cqian12@ucsc.edu https://users.soe.ucsc.edu/~qian/
Thank You Chen Qian


Download ppt "Lecture 1: Bloom Filters"

Similar presentations


Ads by Google