Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1

Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 2

Hash Tables 4  Randomized Data Structure  Implementing the “Dictionary” abstract data type (ADT)  Insert  Delete  Lookup  We’ll assume no duplicates  Applications  Symbol Tables/Compilers: which variables already declared?  ISPs: is IP address spam/blacklisted?  many others

Setup 5  Universe U of all possible elements  All possible 2 32 IP addresses  All possible variables that can be declared  Maintain a possibly evolving subset S ⊆ U  |S|=m and |U| >> m  S might be evolving

Naïve Dictionary Implementations (1) 6  Option 1: Bit Vectors  An array A, keeping one bit 0/1 for each element of U  Insert element i => A[i] = 1  Delete element i => A[i] = 0  Lookup element i => return A[i] == 1  Time complexity of every operation is O(1)  Space: O(U) (e.g. 2 32 for IP addresses) Quick but Not Scalable!

Naïve Dictionary Implementations (2) 7  Option 2: Linked List  One entry for each element in S  Insert element i => check if i exists, if not append to list  Delete element i => find i in the list and remove  Lookup element i => go through the entire list  Time complexity of every operation is O(|S|)  Space: O(S) space Scalable but Not Quick!

Hash Tables: Best of Both Worlds 8  Randomized Dictionary that is:  Quick: O(1) expected time for each operation  Scalable: O(S) space

Hash Tables: High-level Idea 9  Buckets: distinct locations in the hash table  Let n be # buckets  n ≈ m (recall m =|S|)  i.e., Load factor: m/n = O(1)  Hash Function h: U -> {0, 1, …, n-1}  We store each element x in bucket h(x) U: universe m = size of S n = # buckets

Hash Tables: High-level Idea 10 0 1.. n- 2 n- 1 h U … U: universe m = size of S n = # buckets

Collisions 11  Multiple elements are hashed to the same bucket.  Assume we are about to insert new x and h(x) is already full  Resolving Collisions:  Chaining: Linked list per bucket; append x to the list  Open Addressing: If h(x) is already full, we deterministically assign x to another empty bucket  Saves space

Chaining 12 0 1.. n-2 n-1 Null … e 3 = h(e 3 )=1 U: universe m = size of S n = # buckets

Chaining 13 0 1.. n-2 n-1 Null … e3e3 e 7 = h(e 7 )=n-2 U: universe m = size of S n = # buckets

Chaining 14 0 1.. n-2 n-1 Null … e3e3 e 5 = h(e 5 )=n-1 e7e7 U: universe m = size of S n = # buckets

Chaining 15 0 1.. n-2 n-1 Null … e3e3 e 1 = h(e 1 )=1 e7e7 e5e5 U: universe m = size of S n = # buckets

Chaining 16 0 1.. n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e 4 = h(e 4 )=1 U: universe m = size of S n = # buckets

Chaining 17 0 1.. n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 U: universe m = size of S n = # buckets

Operations (With Chaining) 18  Insert(x): Go to bucket h(x); If x is not in list, append it.  Delete(x): Go to bucket h(x); If x is in list, delete it.  Lookup(x): Go to bucket h(x); Return true if x is in the list

Running Time of Operations 19  Assume evaluating the hash function is constant time  May not be true for all hash functions  Consider an element x 0 1.. n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 Lookup: O(|Linked list h(x)|) Insert: O(|Linked list h(x)|) Delete: O(|Linked list h(x)|) U: universe m = size of S n = # buckets

Worst & Best Scenarios 20  Assume m: # elements in the hash table  Worst Case: O(m)  Best Case: O(1) |Linked lists| depends on the quality of the hash function! Fundamental Question: How can we choose “good” hash functions? U: universe m = size of S n = # buckets

Bad Hash Functions 21  Recall our IP addresses example: 32 bits  # buckets n = 2 8  Idea: Use most significant 8 bits  Big correlations with geography of how IP addresses are assigned: 171, 172 as the first 8 bits is common  Lots of addresses would get mapped to the same bucket In practice should be very careful when picking hash functions! U: universe m = size of S n = # buckets

Is There A Single Good Hash Function? 22  Idea: Design a clever hash function **h** that spreads every data sets evenly across the buckets.  Problem: Cannot exist! 0 1.. n- 2 n- 1 **h** U Recall |U| >> m ≈n by pigeonhole: ∃ bucket i, s.t. |list i| ≥ |U|/n If S is all from i, then all operations O(m)!

No Single Good Hash Function! 23 Claim: For every single hash function h, there is a pathological data set! Proof: By pigeonhole principle

Solution: Pick a Hash Function Randomly 24 Design a set or a “family” H of hash functions, s.t. ∀ data sets S, if we pick a h ∈ H randomly, then almost always we spread S out evenly across buckets. Question: Why couldn’t you have randomness inside your hash function?

Clarification on Proposed Analysis Hash Table Input:S Performance Pick h randomly from H We’ll analyze the expected performance on any but fixed input S.

Clarification on Proposed Analysis Hash Table Input:S Performance 1 Pick h 1 randomly from H Hash Table Input:S Performance 2 Pick h 2 randomly from H … Hash Table Input:S Performance t Pick h t randomly from H

Roadmap 27 1.Define H being “Universal” 2.If H is universal and we pick h ∈ H randomly, then our hash table has O(1) expected cost 3.Show simple and practical H exist.

1. Universal Family of Hash Functions 28 Let H be a set of functions from |U| to {0, 1, …, n-1}. Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y, if h is chosen uniformly at random from H then: Pr(h(x) = h(y)) ≤ 1/n I.e., the fraction of hash functions of H that make x & y collide is at most 1/n Why 1/n? “As if we were independently mapping x, y to buckets (& uniformly at random).” U: universe m = size of S n = # buckets

2. Universality => Operations Are O(1) 29 Let H be a universal family of hash functions from |U| to {0, 1, …, n-1}. Recall m = O(n) Claim: If h is picked randomly from H => for any data set S, hash table operations are O(1). U: universe m = size of S n = # buckets

2. Universality => Operations Are O(1) 30 Proof: U: universe m = size of S n = # buckets Hash Table S Pick h randomly from H 0 1.. n-2 n-1 … e3e3 e7e7 e5e5 e1e1 e4e4 e9e9 e 27 A new element x arrives. Say we want to perform Lookup(x). Cost: O(# elements in bucket h(x)). This quantity is a random variable. Call it Z.

2. Universality => Operations Are O(1) 31 Proof Continued: Z=# elements in bucket h(x). For each element y ∈ S, let X y be 1 if h(y) = h(x). U: universe m = size of S n = # buckets 1 is in case x is already there Q.E.D

3. Universal Families of HF Exist (1) 32 Let n=2 b, |U|=2 t and t>b Represent each x as t bit binary vector Ex: |U|=2 7 =128, hash table has size 2 4 =16 |U| = 2 t n = 2 b 0110101 1001100 1110001 0011000 0 1 1 0 1 0 0 1 1 0 0 M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 bucket 12 element 52

3. Universal Families of HF Exist (2) 33 h(x): Mx: 2 t -> 2 b or U -> {0, 1, …, n-1} H = All possible b x t 0/1 random matrices 0110101 1001100 1110001 0011000 0 1 1 0 1 0 0 1 1 0 0 M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 |U| = 2 t n = 2 b

Proof that H is Universal (1) 34 Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n = 1/2 b, when M is picked uniformly at random from H. => equivalently when each cell of M is picked randomly. 0110101 1001100 1110001 0011000 0 1 1 0 1 0 0 1 1 0 0 M x h(x) = |U| = 2 t n = 2 b

Proof that H is Universal (2) 35 x, y differ in at least one bit (say w.l.o.g., the last bit) let z = x-y z1z1 z2z2 z3z3 z4z4 z5z5 … 1 0 0 0 0 MzMz = 0110101 1001100 1110001 0011000 Q: Pr(Mz =0)? |U| = 2 t n = 2 b

Proof that H is Universal (3) 36 Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0) **Event Mz[i]=0 is independent from Mz[j]=0 since, the coin flips for Mz[i] are independent from the coin flips for Mz[j]** Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) … Pr(Mz[b]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … 1 0 0 0 0 M z Mz = 0110101 1001100 1110001 0011000 Q: Pr(Mz[i] =0)? |U| = 2 t n = 2 b

Proof that H is Universal (4) 37 Pr(Mz[i]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … 1 0 0 0 0 M z Mz = 0110101 1001100 1110001 0011000 Mz[i] = m i1 z 1 + m i2 z 2 + … 1*m it Let y be the (modulo 2) sum of the first t-1 multiplications, Mz[i] = 0 iff m it is equal to ¬y! i

Proof that H is Universal (5) 38  Pr(Mz[i] =0) = 1/2  Pr(Mx=My)=Pr(Mz = 0) = 1/2 b = 1/n Irrespective of the fist t-1 coin flips, it all depends on the last coin flip. |U| = 2 t n = 2 b Q.E.D

Storing and Evaluating Hash Function h (M) 39 Q: How much space do we need to store the random matrix M? A: bt bits = O(log|U|log(n)) How much time to evaluate Mz? A: Naïve: bt 2 =O(log|U|log(n)) Summary: H is a relatively fast, and practical universal family of hash functions

Another Possible Family 40 We’re hashing from U -> {0, 1, …, n-1} Let H be the set of all such functions Question: Is H universal?

Another Possible Family 41 We’re hashing from U -> {0, 1, …, n-1} Q1: # such functions? A1: n U Q2: # functions in which h(x) = h(y)=j? A2: n U-2 Q3: # functions in which h(x) = h(y)? A3: nn U-2 = n U-1 Q4: Pr(h(x) = h(y)? Answer: 1/n => H is universal!

Why is H Impractical? 42 There are n U functions in H. What’s cost of storing a function h from H? log(| H |)=O(Ulog(n)! Not Practical!

Summary 1.Hash Tables 2.Defined Universal Family of Hash Functions 3.Universal family => Hash Table ops are expected O(1) time 4.Universal families exist 43

Bloom Filters 45  Randomized Data Structure  Implementing a limited version of Dictionary ADT  Insert  Lookup  Compared to Hash Tables:  Applications  Website caches for ISPs

Same Setup As Hash Tables 46  Universe U of all possible elements  All possible 2 32 IP addresses  Maintain a subset S ⊆ U  |S|=m and |U| >> m

Bloom Filters 47  A Bloom Filter consists of:  A bit array of size n initially all 0 (not buckets)  k hash functions h 1, …, h k Space cost per element= n/m

Insertions 48  Insert(a): set all h i (a) to 1 => O(k)  Let k = 3 h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 h 1 (y)=1, h 2 (y)=5 h 3 (y)=9 h 1 (z)=10, h 2 (z)=11 h 3 (z)=5 Do you see why there would be false positives?

Lookup 49 Lookup(a): return true if all h i (a) = 1 => O(k) x: h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 => Lookup(x) = true z: h 1 (z)=3, h 2 (z)=9 h 3 (z)=4 => Lookup(z) = false t: h 1 (t)=0, h 2 (t)=1 h 3 (t)=2 => Lookup(t) = true 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Can Bloom Filters Be Useful? 50 Can Bloom Filters be both space efficient and have a low false positive rate? What is the probability of false positives as a function of n, m and k?

Probability of False Positive 51  We have inserted m elements to the bloom filter.  New element z arrives, not been inserted before.  Q: What’s the Pr(false positive for z)?  Assume h 1 (z) = j 1, …, h k (z) = j k **Simplifying (Unjustified) Assumption**: All hashing is totally random! ∀ h i, ∀ x, h i (x) is uniformly random from {1, …, m} and independent from all h j (y) for all y. Warning: To simplify analysis. Won’t hold in practice.

Pr(bit j is 1 after m insertions)? 52 Consider a particular bit j in the array. Q1: Fix h i and an element x. Pr(h i (x) turns j to 1)? A1: 1/n Q2: Pr(x turns j to 1)? (Prob. one of h 1 (x), …, h k (x) = j?) A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n) k Q3: Pr(Bit j = 1 after m insertions)? A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n) km

Pr(false positive for x)? 53 Recall for x we check k bits: h 1 (x) = j 1, …, h k (x) = j k Pr(bit j i = 1) = 1 – (1-1/n) km Pr(false positive) = Pr(all j i = 1)= (1 – (1-1/n) km ) k Recall Calculus fact: (1+x) ≤ e x From the same fact: around x=0, (1+x) ≈ e x Pr(false positive) ≈

How Does Failure Rate Change With k,n? 54 Observation 1: as n increases failure rate decreases. Observation 2: as k increases(the # hash functions)  more bits to check => less likely to fail  more bits/object => more likely to fail  unclear if it increases or decreases Question: What’s the optimal k for fixed n/m? Answer (by taking derivatives): k=ln(2)n/m = 0.69*n/m Failure rate =

How Does Failure Rate Change With k,n? 55 For fixed n/m, with optimal k=ln(2)n/m Failure rate: Already at n=8m, rate is 1-2%. Exponentially decrease with n/m.

56 Next Week Dynamic Programming

Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Similar presentations

Presentation on theme: "Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Similar presentations

Presentation on theme: "Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1."— Presentation transcript:

Similar presentations

About project

Feedback