Lecture 1: Bloom Filters

Slides:



Advertisements
Similar presentations
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Cooperative Caching of Dynamic Content on a Distributed Web Server Vegard Holmedahl, Ben Smith, Tao Yang Speaker: SeungLak Choi, DB Lab., CS Dept.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Bloom Filters Kira Radinsky Slides based on material from:
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol By Abuzafor Rasal and Vinoth Rayappan.
Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Hash Tables1 Part E Hash Tables  
Internet Cache Pollution Attacks and Countermeasures Yan Gao, Leiwen Deng, Aleksandar Kuzmanovic, and Yan Chen Electrical Engineering and Computer Science.
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
Hash Tables1 Part E Hash Tables  
Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Web Caching Schemes For The Internet – cont. By Jia Wang.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
Web Prefetching Between Low-Bandwidth Clients and Proxies : Potential and Performance Li Fan, Pei Cao and Wei Lin Quinn Jacobson (University of Wisconsin-Madsion)
World Wide Web Caching: Trends and Technologys Gerg Barish & Katia Obraczka USC Information Sciences Institute, USA,2000.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
TinyLFU: A Highly Efficient Cache Admission Policy
Web Performance 성민영 SNU Computer Systems lab.. 2 차례 4 Modeling the Performance of HTTP Over Several Transport Protocols. 4 Summary Cache : A Scaleable.
Dr. Yingwu Zhu Summary Cache : A Scalable Wide- Area Web Cache Sharing Protocol.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Energy-Efficient Data Caching and Prefetching for Mobile Devices Based on Utility Huaping Shen, Mohan Kumar, Sajal K. Das, and Zhijun Wang P 邱仁傑.
The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Adaptive Configuration of a Web Caching Hierarchy Pranav A. Desai Jaspal Subhlok Presented by: Pranav A. Desai.
SketchVisor: Robust Network Measurement for Software Packet Processing
Sets and Maps Chapter 9.
WWW and HTTP King Fahd University of Petroleum & Minerals
The Variable-Increment Counting Bloom Filter
Web Caching? Web Caching:.
Hashing CENG 351.
Cache Memory Presentation I
Internet Networking recitation #12
563.10: Bloom Cookies Web Search Personalization without User Tracking
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
Kalyan Boggavarapu Lehigh University
Computer Science 2 Hashing
Edge computing (1) Content Distribution Networks
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Network Applications of Bloom Filters: A Survey
CMPE 252A : Computer Networks
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Sets and Maps Chapter 9.
Hash Functions for Network Applications (II)
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Lecture 1: Bloom Filters Chen Qian Department of Computer Engineering qian@ucsc.edu https://users.soe.ucsc.edu/~qian/

Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 2

Hash Functions A hash function is any function that can be used to map data of arbitrary size to data of a fixed size 00 Hash function Map names to integers from 0 to 15 keys 01 John Smith 02 Lisa Smith 03 Sam Doe 04 Sandra Dee 05 collision ⋮ 3 15

Hash Functions Requirements for hash functions Can be applied to any sized key 𝑥 Produces fixed-length output d Easy to compute d=ℎ(𝑥) One way property: given 𝑑 is infeasible to find ℎ 𝑥 =𝑑 Weak collision resistance: given x is infeasible to find y that makes ℎ 𝑦 =ℎ(𝑥) Strong collision resistance: it is infeasible to find any x, y to make ℎ 𝑥 =ℎ(𝑦) Uniformity: Every hash value in the output range should be generated with roughly the same probability 4 Red ones are required for cryptographic hashes

Examples of Hash functions mod CRC-16, -32 Jenkins, CityHash, FarmHash, MetroHash, etc. MD5, SHA-1, MD6, SHA-2, SHA-256 CRCs are based on the theory of cyclic error-correcting codes 5

Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 6

Bloom Filters Membership query A Bloom filter is a simple space-efficient randomized data structure Represent a set in order to support membership queries Burton Bloom introduced Bloom filters in the 1970s C A E B Z ⋯ Element P is in the S? P set S 7

Bloom Filters Standard Bloom filters: Represent a set 𝑆={ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 } by an array of m bits Use k independent hash functions ℎ 1 , ℎ 2 ,…, ℎ 𝑘 assumption Hash functions map each item to a random number uniformly over the range {1,2,…,𝑚} How to construct the bloom filters 8

Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3 𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 9

Bloom Filters m bits array Initially all set to 0 𝑥 1 assume k is 3 𝑥 1 assume k is 3 Insert 𝑥 1 , 𝑥 2 ℎ 1 ( 𝑥 1 ) ℎ 3 ( 𝑥 1 ) ℎ 2 ( 𝑥 1 ) 1 10

Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2 𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 11

Bloom Filters m bits array Initially all set to 0 𝑥 1 𝑥 2 𝑥 1 𝑥 2 assume k is 3 ℎ 1 ( 𝑥 2 ) Insert 𝑥 1 , 𝑥 2 ℎ 2 ( 𝑥 2 ) ℎ 3 ( 𝑥 2 ) 1 12

Bloom Filters m bits array Initially all set to 0 assume k is 3 𝑥 1 assume k is 3 𝑥 1 𝑥 2 Insert 𝑥 1 , 𝑥 2 1 𝑦 1 𝑥 2 Query 𝑥 2 , 𝑦 1 (alien element) All ones, 𝑥 2 is a member 1 𝑦 1 is not a member 13

Bloom Filters False positive False negative 14

Bloom Filters Have false positive No false negative There is a clear tradeoff between m and the probability of a false positive 𝑥 1 𝑦 𝑥 2 Alien element: y y is a member 1 15

Bloom Filters n keys, m bits array, k hash functions After inserting n keys, let p denotes the probability that a particular bit is 0 𝑝= (1− 1 𝑚 ) 𝑘𝑛 ≈ 𝑒 −𝑘∗ 𝑛 𝑚 Let f denotes the probability of false positive 𝑓= (1−𝑝) 𝑘 = (1− (1− 1 𝑚 ) 𝑘𝑛 ) 𝑘 When 𝑘=𝑙𝑛2 ∗ 𝑚 𝑛 , f achieves optimal value 16

Bloom Filters No deletion 1 𝑥 1 𝑥 2 delete 𝑥 1 17

Bloom Filters No deletion 1 𝑥 2 𝑥 2 is not a member delete 𝑥 1 18

Bloom Filters Features Low storage requirement Fast membership checking No false negative Low false positive probability No deletion is allowed 19

Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 20

Counting Bloom Filters Allow deletion Each entry in the counting Bloom filters is a small counter l bits per counter 𝑥 1 insert 𝑥 1 , 𝑥 2 1 21

Counting Bloom Filters Each entry in the Bloom filters is a small counter l bits per counter 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 1 22

Counting Bloom Filters Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 23

Counting Bloom Filters Each entry in the Bloom filters is a small counter l bits per counter 1 2 𝑥 1 𝑥 2 insert 𝑥 1 , 𝑥 2 𝑥 2 delete 𝑥 1 1 24

Counting Bloom Filters The size of the bits for the counter 𝑝 𝑐 𝑖 =𝑗 = 𝑛𝑘 𝑗 ( 1 𝑚 ) 𝑗 (1− 1 𝑚 ) 𝑛𝑘−𝑗 𝑝(𝑐(𝑖)≥𝑗)≤ 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤ ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The probability that the any counter is at least 𝑗 is bounded by 𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max⁡(𝑐(𝑖))≥𝑗)≤𝑚𝑝(𝑐(𝑖)≥𝑗) 𝑝(max⁡(𝑐(𝑖))≥𝑗)≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 ( 𝑒𝑛𝑘 𝑗𝑚 ) 𝑗 The count associated with the i-th counter loose bound 25

Counting Bloom Filters 𝑝 max 𝑐 𝑖 ≥𝑗 ≤𝑚 𝑛𝑘 𝑗 1 𝑚 𝑗 ≤𝑚 𝑒𝑛𝑘 𝑗𝑚 𝑗 Let 𝑘≤ 𝑙𝑛2 𝑚 𝑛 𝑝(max⁡(𝑐(𝑖))≥𝑗)≤𝑚 ( 𝑒𝑙𝑛2 𝑗 ) 𝑗 𝑝 max 𝑐 𝑖 ≥16 ≤1.37∗ 10 −15 ∗𝑚 If we allow 4 bits per counter, the probability of overflow for practical values of m during the initial insertion in the CBF is minuscule. 26

Outline Pre-Knowledge Bloom filters Counting Bloom filters Applications 27

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 28

Background and Problems Caching has been recognized as one of the most important techniques to reduce bandwidth consumption with the development of World Wide Web. Users Proxy Caches Bottleneck ⋯ Rest of the Internet ⋯ 29

Background and Problems Proxy caches should cooperate and serve each other’s misses. If the object is missed in current proxy cache, we can get the object from other proxy caches How do we know which proxy caches hold the requested objects that miss in current proxy cache? This process is called “Web Cache Sharing” Internet cache protocol (ICP) is one of the protocol for web cache sharing 30

Overview Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 31

Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2 Miss, ICP query Parent Web server 2 Miss, ICP query Proxy Cache Proxy Cache 2 Miss, ICP query Proxy Cache 1. web page request Sibling 2 Hit, return result 32 Sibling Client

Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.1 ICP miss Parent Web server 2.1 ICP miss Proxy Cache Proxy Cache 2.1 ICP miss Proxy Cache 1. web page request Sibling 33 Sibling Client

Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 3. web page request Parent Web server Proxy Cache Proxy Cache Proxy Cache Sibling return result 34 Sibling Client

Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache 2.2 ICP hit Parent Web server 2.2 ICP hit Proxy Cache Proxy Cache 2.2 ICP miss Proxy Cache 1. web page request Sibling 35 Sibling Client

Internet Cache Protocol (ICP) ICP supports discovery and retrieval of documents from neighboring caches. Proxy caches connect hierarchically. Proxy Cache Parent Web server 3. Web page request Proxy Cache Proxy Cache Proxy Cache Sibling return result 36 Sibling Client

Overhead of ICP ICP is not a scalable protocol ICP relies on query messages to find cache hits in other proxies Each proxy cache needs to handle 𝑁−1 ∗ 1−𝐻 ∗𝑅 inquiries from neighboring caches N: Number of proxy caches H: Hit rate R: Average requests As N increases, the overhead quickly becomes prohibitive 37

Overhead of ICP 4 proxy caches Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) 133.65 (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% 116.87 (5%) 24% 146.50 (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% Exp2 45% 2.21 (1%) 80.83 (2%) 111.10 (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% 118.59 (1%) 7% 39968 (0%) 7300% 257K (2%) -1% 314K (1%) ICP incurs considerable overhead even when the number of proxies is as low as four. 38

Benefits of Web Cache Sharing Use simulation results to illustrate. Compare four different cooperation schemes No cache sharing Proxies do not collaborate Simple cache sharing (ICP-style) Do not coordinate cache replacement (each LRU) Client cache D locally request D A B C … A B C D … return D D E B … request D Proxy Cache 1 Proxy Cache 2 return D 39 miss

Benefits of Web Cache Sharing Compare four different cooperation schemes Single-copy cache sharing Mark D as most-recently-accessed & Increase its caching priority Client request D A B C … A B C return D D E B … request D Proxy Cache 1 Proxy Cache 2 return D miss Compared to simple cache sharing, this scheme eliminates the storage of duplicate copies, increases the utilization of available cache space. 40

Benefits of Web Cache Sharing Compare four different cooperation schemes Global cache Client share cache contents request D return D A B C D E … A B C … D E B A C … D E B … Proxy Cache 1 Proxy Cache 2 Coordinate replacement 41 C is replaced by F from the cache

Benefits of Web Cache Sharing Compare four different cooperation schemes Global cache Client share cache contents request D return D A B F D E … D E B A F … Proxy Cache 1 Proxy Cache 2 Coordinate replacement We can regard global cache as one unified cache with global LRU replacement to the users 42

Benefits of Web Cache Sharing Use simulation results to illustrate Compare four different cooperation schemes Collect five sets of traces of HTTP requests DEC, UCB, Upisa, Questnet, NLANR 43

Benefits of Web Cache Sharing Simulation result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache 44

Benefits of Web Cache Sharing Simulation Result All cache sharing schemes significantly improve the hit ratio over no cache sharing Hit ratio under single-copy cache sharing and simple cache sharing are generally the same A waste of memory has only a small effect A smaller effective cache does not make a significant difference in the hit ratio 45

Benefits of Web Cache Sharing Simulation Result Hit ratio under single-copy cache sharing and simple cache sharing are generally the same as or even higher than the hit ratio under global cache Global LRU sometimes performs less well than group-wise LRU In global cache setting, a burst of rapid successive requests from one user might disturb the working set of many other users Conclusion ICP-style simple cache sharing obtains most of the benefits of more elaborate cooperative caching. 46

Dilemma Summary Cache Benefits of ICP High overhead of ICP 47

Overview Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 48

Summary Cache Basic idea Two tolerate errors Each proxy stores a summary of directory of cached documents in every other proxies When a user request misses in the local cache, find proxies that store the requested document with summaries Two tolerate errors False misses: summary does not reflect that the document is cached at some other proxies False hit: summary wrongly indicates that the document is cached at some other proxies 49

Summary Cache Challenges of scalability Naive summary representations Network overhead: interproxy traffic Memory to store summaries: summaries need to be stored in DRAM for performance reasons Naive summary representations Exact-directory approach: cache directory, each URL is represented by its 16-byte MD5 signature Problem: Consume too much memory E.g. 16 proxies of 8GB each and an average file size of 8KB, the exact-directory approach will consume 16−1 ∗16∗ 8𝐺𝐵 8𝐾𝐵 𝑏𝑦𝑡𝑒=240𝑀𝐵 50

Summary Cache Summary representations in summary cache Good news Bloom filter Memory-efficient and with low false hits Good news Summaries do not have to be up-to-date or accurate Summaries do not have to be updated every time the cache directory is changed The update can occur upon regular time intervals Use an experiment to illustrate it Delaying the update of summaries until percentage of cached documents that are new reaches a threshold 51

Bloom Filter as Summaries Construction A proxy builds a Bloom filter from the list of URL’s of cached documents A proxy sends the Bloom filter plus the corresponding hash functions to other proxies Update Use counting Bloom filter to update the local filter Specify which bits in the bit array are flipped or send the whole array to update Bloom filter in other proxies 52

Overview Background and Problems Internet cache protocol (ICP) Summary cache Evaluation 53

Evaluation Three configurations (m denotes the size of Bloom filter array) Bloom_filter_8/16/32: m is 8/16/32 times the average number of documents in the cache Four hash functions First calculate the MD5 signature of the URL (128 bits) Divides 128 bits into four 32-bit word Take the modulus of each 32-bit word by m 54

Evaluation Memory consumption Low memory consumption Approach DEC NLANR Exact_dir 2.8% 0.70% Server_name 0.19% 0.08% Bloom_filter_8 0.038% Bloom_filter_16 0.38% 0.075% Bloom_filter_32 0.75% 0.15% Low memory consumption 55

Evaluation Exp1 Hit Ratio Client Latency User CPU System CPU UDP Msgs TCP Msgs Total Packets no ICP 25% 2.75 (5%) 94.42 (5%) 133.65 (6%) 615 (28%) 334K (8%) 355K (7%) ICP 3.07 (0.7%) 12% 116.87 (5%) 24% 146.50 (5%) 10% 54774 (0%) 9000% 328K (4%) -2% 402K (3%) 13% SC-ICP 2.85 (1%) 4% 95.07 (6%) 0.7% 134.61 (6%) 1079 (0%) 75% 330K (5%) -1% 351K (5%) Exp2 45% 2.21 (1%) 80.83 (2%) 111.10 (2%) 540 (3%) 272K (3%) 290K (3%) 2.39 (1%) 8% 97.36 (1%) 20% 118.59 (1%) 7% 39968 (0%) 7300% 257K (2%) 314K (1%) 2.25 (1%) 2% 82.03 (3%) 1% 111.87 (3%) 799 (5%) 48% 269K (5%) 287K (5%) 56

Conclusion Summary Cache enhanced ICP Reduces the number of interproxy protocol message by factor of 25 to 60 Reduces the bandwidth consumption by over 50% Almost no degradation in the cache hit ratios Reduces CPU overhead between 30% to 95% 57

Chen Qian cqian12@ucsc.edu https://users.soe.ucsc.edu/~qian/ Thank You Chen Qian cqian12@ucsc.edu https://users.soe.ucsc.edu/~qian/