Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal

Slides:

Advertisements

Similar presentations

Many-to-one Trapdoor Functions and their Relations to Public-key Cryptosystems M. Bellare S. Halevi A. Saha S. Vadhan.

Advertisements

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Cuckoo Filter: Practically Better Than Bloom

Indian Statistical Institute Kolkata

An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Hit or Miss ? !!!.  Small size.  Simple and fast.  Implementable with hardware.  Does not need too much power.  Does not predict miss if we have.

Bloom Filters Kira Radinsky Slides based on material from:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan.

Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.

Look-up problem IP address did we see the IP address before?

The Goldreich-Levin Theorem: List-decoding the Hadamard code

Bloom filters Probability and Computing Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

1 Verification Codes Michael Luby, Digital Fountain, Inc. Michael Mitzenmacher Harvard University and Digital Fountain, Inc.

1 Slides by Asaf Shapira & Michael Lewin & Boaz Klartag & Oded Schwartz. Adapted from things beyond us.

Decision Procedures An Algorithmic Point of View

Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.

Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.

Interactive proof systems Section 10.4 Giorgi Japaridze Theory of Computability.

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

Bloom Filters. Lecture on Bloom Filters Not described in the textbook ! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network.

Cuckoo Filter: Practically Better Than Bloom Author: Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher Publisher: ACM CoNEXT 2014 Presenter:

Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,

Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.

Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.

Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008

Chapter 6: Discrete Probability

Probabilistic Algorithms

Updating SF-Tree Speaker: Ho Wai Shing.

New Characterizations in Turnstile Streams with Applications

Hashing, Hash Function, Collision & Deletion

Simplifying Fractions

Lower bounds for approximate membership dynamic data structures

The Variable-Increment Counting Bloom Filter

CS 332: Algorithms Hash Tables David Luebke /19/2018.

i) Two way ANOVA without replication

Streaming & sampling.

Introduction to Hashing - Hash Functions

563.10: Bloom Cookies Web Search Personalization without User Tracking

Mining Data Streams (Part 2)

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Algorithms Analysis Section 3.3 of Rosen Spring 2017

Introduction to Algorithms 6.046J/18.401J

Y. Kotidis, S. Muthukrishnan,

Introduction to PCP and Hardness of Approximation

Network Applications of Bloom Filters: A Survey

Chapter 11 Limitations of Algorithm Power

Chapter 5: Probabilistic Analysis and Randomized Algorithms

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Cryptographic Hash Functions Part I

How to use hash tables to solve olympiad problems

CS21 Decidability and Tractability

By: Ran Ben Basat, Technion, Israel

Instructor: Aaron Roth

Bloom filters From Probability and Computing

Hash Functions for Network Applications (II)

Podcast Ch21a Title: Hash Functions

Chapter 5: Probabilistic Analysis and Randomized Algorithms

Cryptography Lecture 20.

Randomized Algorithms

Algorithms Analysis Section 3.3 of Rosen Spring 2018

Randomized Algorithms

Lecture-Hashing.

Presentation transcript:

Bloom filters Probability and Computing Michael Mitzenmacher Eli Upfal Randomized algorithms and probabilistic analysis P109~P111 Michael Mitzenmacher Eli Upfal

Introduction Approximate set membership problem . Trade-off between the space and the false positive probability . Generalize the hashing ideas.

Approximate set membership problem Suppose we have a set S = {s1,s2,...,sm}  universe U Represent S in such a way we can quickly answer “Is x an element of S ?” To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) If xS , we must answer yes .

Bloom filters Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2.  sS, A[hi(s)] = 1 for 1 i  k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i  k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS

Initial with all 0 Each element of S is hashed k times 1 x1 x2 Each element of S is hashed k times Each hash location set to 1 1 y To check if y is in S, check the k hash location. If a 0 appears , y is not in S 1 y If only 1s appear, conclude that y is in S This may yield false positive Initial with all 0

The probability of a false positive We assume the hash function are random. After all the elements of S are hashed into the bloom filters ,the probability that a specific bit is still 0 is

To simplify the analysis ,we can assume a fraction p of the entries are still 0 after all the elements of S are hashed into bloom filters. In fact,let X be the random variable of number of those 0 positions. By Chernoff bound It implies X/n will be very close to p with a very high probability

The probability of a false positive f is To find the optimal k to minimize f . Minimize f iff minimize g=ln(f) k=ln(2)*(n/m) f = (1/2)k = (0.6185..)n/m The false positive probability falls exponentially in n/m ,the number bits used per item !!

Conclusion A Bloom filters is like a hash table ,and simply uses one bit to keep track whether an item hashed to the location. If k=1 , it’s equivalent to a hashing based fingerprint system. If n=cm for small constant c,such as c=8 ,then k=5 or 6 ,the false positive probability is just over 2% . It’s interesting that when k is optimal k=ln(2)*(n/m) , then p= 1/2. An optimized Bloom filters looks like a random bit-string