Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal

Slides:



Advertisements
Similar presentations
Many-to-one Trapdoor Functions and their Relations to Public-key Cryptosystems M. Bellare S. Halevi A. Saha S. Vadhan.
Advertisements

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Cuckoo Filter: Practically Better Than Bloom
Indian Statistical Institute Kolkata
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.
Bloom Filters Kira Radinsky Slides based on material from:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Look-up problem IP address did we see the IP address before?
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Verification Codes Michael Luby, Digital Fountain, Inc. Michael Mitzenmacher Harvard University and Digital Fountain, Inc.
1 Slides by Asaf Shapira & Michael Lewin & Boaz Klartag & Oded Schwartz. Adapted from things beyond us.
Decision Procedures An Algorithmic Point of View
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.
1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
Interactive proof systems Section 10.4 Giorgi Japaridze Theory of Computability.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.
Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Chapter 6: Discrete Probability
Probabilistic Algorithms
Updating SF-Tree Speaker: Ho Wai Shing.
New Characterizations in Turnstile Streams with Applications
Hashing, Hash Function, Collision & Deletion
Simplifying Fractions
Lower bounds for approximate membership dynamic data structures
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
i) Two way ANOVA without replication
Streaming & sampling.
Introduction to Hashing - Hash Functions
563.10: Bloom Cookies Web Search Personalization without User Tracking
Mining Data Streams (Part 2)
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Algorithms Analysis Section 3.3 of Rosen Spring 2017
Introduction to Algorithms 6.046J/18.401J
Y. Kotidis, S. Muthukrishnan,
Introduction to PCP and Hardness of Approximation
Network Applications of Bloom Filters: A Survey
Chapter 11 Limitations of Algorithm Power
Chapter 5: Probabilistic Analysis and Randomized Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Cryptographic Hash Functions Part I
How to use hash tables to solve olympiad problems
CS21 Decidability and Tractability
By: Ran Ben Basat, Technion, Israel
Instructor: Aaron Roth
Bloom filters From Probability and Computing
Hash Functions for Network Applications (II)
Podcast Ch21a Title: Hash Functions
Chapter 5: Probabilistic Analysis and Randomized Algorithms
Cryptography Lecture 20.
Randomized Algorithms
Algorithms Analysis Section 3.3 of Rosen Spring 2018
Randomized Algorithms
Lecture-Hashing.
Presentation transcript:

Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal

Introduction Approximate set membership problem . Trade-off between the space and the false positive probability . Generalize the hashing ideas.

Approximate set membership problem Suppose we have a set S = {s1,s2,...,sm}  universe U Represent S in such a way we can quickly answer “Is x an element of S ?” To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) If xS , we must answer yes .

Bloom filters Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2.  sS, A[hi(s)] = 1 for 1 i  k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i  k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS

Initial with all 0 Each element of S is hashed k times 1 x1 x2 Each element of S is hashed k times Each hash location set to 1 1 y To check if y is in S, check the k hash location. If a 0 appears , y is not in S 1 y If only 1s appear, conclude that y is in S This may yield false positive Initial with all 0

The probability of a false positive We assume the hash function are random. After all the elements of S are hashed into the bloom filters ,the probability that a specific bit is still 0 is

To simplify the analysis ,we can assume a fraction p of the entries are still 0 after all the elements of S are hashed into bloom filters. In fact,let X be the random variable of number of those 0 positions. By Chernoff bound It implies X/n will be very close to p with a very high probability

The probability of a false positive f is To find the optimal k to minimize f . Minimize f iff minimize g=ln(f) k=ln(2)*(n/m) f = (1/2)k = (0.6185..)n/m The false positive probability falls exponentially in n/m ,the number bits used per item !!

Conclusion A Bloom filters is like a hash table ,and simply uses one bit to keep track whether an item hashed to the location. If k=1 , it’s equivalent to a hashing based fingerprint system. If n=cm for small constant c,such as c=8 ,then k=5 or 6 ,the false positive probability is just over 2% . It’s interesting that when k is optimal k=ln(2)*(n/m) , then p= 1/2. An optimized Bloom filters looks like a random bit-string