Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1
Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 2
Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 3
Hash Tables 4 Randomized Data Structure Implementing the “Dictionary” abstract data type (ADT) Insert Delete Lookup We’ll assume no duplicates Applications Symbol Tables/Compilers: which variables already declared? ISPs: is IP address spam/blacklisted? many others
Setup 5 Universe U of all possible elements All possible 2 32 IP addresses All possible variables that can be declared Maintain a possibly evolving subset S ⊆ U |S|=m and |U| >> m S might be evolving
Naïve Dictionary Implementations (1) 6 Option 1: Bit Vectors An array A, keeping one bit 0/1 for each element of U Insert element i => A[i] = 1 Delete element i => A[i] = 0 Lookup element i => return A[i] == 1 Time complexity of every operation is O(1) Space: O(U) (e.g for IP addresses) Quick but Not Scalable!
Naïve Dictionary Implementations (2) 7 Option 2: Linked List One entry for each element in S Insert element i => check if i exists, if not append to list Delete element i => find i in the list and remove Lookup element i => go through the entire list Time complexity of every operation is O(|S|) Space: O(S) space Scalable but Not Quick!
Hash Tables: Best of Both Worlds 8 Randomized Dictionary that is: Quick: O(1) expected time for each operation Scalable: O(S) space
Hash Tables: High-level Idea 9 Buckets: distinct locations in the hash table Let n be # buckets n ≈ m (recall m =|S|) i.e., Load factor: m/n = O(1) Hash Function h: U -> {0, 1, …, n-1} We store each element x in bucket h(x) U: universe m = size of S n = # buckets
Hash Tables: High-level Idea n- 2 n- 1 h U … U: universe m = size of S n = # buckets
Collisions 11 Multiple elements are hashed to the same bucket. Assume we are about to insert new x and h(x) is already full Resolving Collisions: Chaining: Linked list per bucket; append x to the list Open Addressing: If h(x) is already full, we deterministically assign x to another empty bucket Saves space
Chaining n-2 n-1 Null … e 3 = h(e 3 )=1 U: universe m = size of S n = # buckets
Chaining n-2 n-1 Null … e3e3 e 7 = h(e 7 )=n-2 U: universe m = size of S n = # buckets
Chaining n-2 n-1 Null … e3e3 e 5 = h(e 5 )=n-1 e7e7 U: universe m = size of S n = # buckets
Chaining n-2 n-1 Null … e3e3 e 1 = h(e 1 )=1 e7e7 e5e5 U: universe m = size of S n = # buckets
Chaining n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e 4 = h(e 4 )=1 U: universe m = size of S n = # buckets
Chaining n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 U: universe m = size of S n = # buckets
Operations (With Chaining) 18 Insert(x): Go to bucket h(x); If x is not in list, append it. Delete(x): Go to bucket h(x); If x is in list, delete it. Lookup(x): Go to bucket h(x); Return true if x is in the list
Running Time of Operations 19 Assume evaluating the hash function is constant time May not be true for all hash functions Consider an element x n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 Lookup: O(|Linked list h(x)|) Insert: O(|Linked list h(x)|) Delete: O(|Linked list h(x)|) U: universe m = size of S n = # buckets
Worst & Best Scenarios 20 Assume m: # elements in the hash table Worst Case: O(m) Best Case: O(1) |Linked lists| depends on the quality of the hash function! Fundamental Question: How can we choose “good” hash functions? U: universe m = size of S n = # buckets
Bad Hash Functions 21 Recall our IP addresses example: 32 bits # buckets n = 2 8 Idea: Use most significant 8 bits Big correlations with geography of how IP addresses are assigned: 171, 172 as the first 8 bits is common Lots of addresses would get mapped to the same bucket In practice should be very careful when picking hash functions! U: universe m = size of S n = # buckets
Is There A Single Good Hash Function? 22 Idea: Design a clever hash function **h** that spreads every data sets evenly across the buckets. Problem: Cannot exist! n- 2 n- 1 **h** U Recall |U| >> m ≈n by pigeonhole: ∃ bucket i, s.t. |list i| ≥ |U|/n If S is all from i, then all operations O(m)!
No Single Good Hash Function! 23 Claim: For every single hash function h, there is a pathological data set! Proof: By pigeonhole principle
Solution: Pick a Hash Function Randomly 24 Design a set or a “family” H of hash functions, s.t. ∀ data sets S, if we pick a h ∈ H randomly, then almost always we spread S out evenly across buckets. Question: Why couldn’t you have randomness inside your hash function?
Clarification on Proposed Analysis Hash Table Input:S Performance Pick h randomly from H We’ll analyze the expected performance on any but fixed input S.
Clarification on Proposed Analysis Hash Table Input:S Performance 1 Pick h 1 randomly from H Hash Table Input:S Performance 2 Pick h 2 randomly from H … Hash Table Input:S Performance t Pick h t randomly from H
Roadmap 27 1.Define H being “Universal” 2.If H is universal and we pick h ∈ H randomly, then our hash table has O(1) expected cost 3.Show simple and practical H exist.
1. Universal Family of Hash Functions 28 Let H be a set of functions from |U| to {0, 1, …, n-1}. Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y, if h is chosen uniformly at random from H then: Pr(h(x) = h(y)) ≤ 1/n I.e., the fraction of hash functions of H that make x & y collide is at most 1/n Why 1/n? “As if we were independently mapping x, y to buckets (& uniformly at random).” U: universe m = size of S n = # buckets
2. Universality => Operations Are O(1) 29 Let H be a universal family of hash functions from |U| to {0, 1, …, n-1}. Recall m = O(n) Claim: If h is picked randomly from H => for any data set S, hash table operations are O(1). U: universe m = size of S n = # buckets
2. Universality => Operations Are O(1) 30 Proof: U: universe m = size of S n = # buckets Hash Table S Pick h randomly from H n-2 n-1 … e3e3 e7e7 e5e5 e1e1 e4e4 e9e9 e 27 A new element x arrives. Say we want to perform Lookup(x). Cost: O(# elements in bucket h(x)). This quantity is a random variable. Call it Z.
2. Universality => Operations Are O(1) 31 Proof Continued: Z=# elements in bucket h(x). For each element y ∈ S, let X y be 1 if h(y) = h(x). U: universe m = size of S n = # buckets 1 is in case x is already there Q.E.D
3. Universal Families of HF Exist (1) 32 Let n=2 b, |U|=2 t and t>b Represent each x as t bit binary vector Ex: |U|=2 7 =128, hash table has size 2 4 =16 |U| = 2 t n = 2 b M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 bucket 12 element 52
3. Universal Families of HF Exist (2) 33 h(x): Mx: 2 t -> 2 b or U -> {0, 1, …, n-1} H = All possible b x t 0/1 random matrices M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 |U| = 2 t n = 2 b
Proof that H is Universal (1) 34 Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n = 1/2 b, when M is picked uniformly at random from H. => equivalently when each cell of M is picked randomly M x h(x) = |U| = 2 t n = 2 b
Proof that H is Universal (2) 35 x, y differ in at least one bit (say w.l.o.g., the last bit) let z = x-y z1z1 z2z2 z3z3 z4z4 z5z5 … MzMz = Q: Pr(Mz =0)? |U| = 2 t n = 2 b
Proof that H is Universal (3) 36 Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0) **Event Mz[i]=0 is independent from Mz[j]=0 since, the coin flips for Mz[i] are independent from the coin flips for Mz[j]** Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) … Pr(Mz[b]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … M z Mz = Q: Pr(Mz[i] =0)? |U| = 2 t n = 2 b
Proof that H is Universal (4) 37 Pr(Mz[i]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … M z Mz = Mz[i] = m i1 z 1 + m i2 z 2 + … 1*m it Let y be the (modulo 2) sum of the first t-1 multiplications, Mz[i] = 0 iff m it is equal to ¬y! i
Proof that H is Universal (5) 38 Pr(Mz[i] =0) = 1/2 Pr(Mx=My)=Pr(Mz = 0) = 1/2 b = 1/n Irrespective of the fist t-1 coin flips, it all depends on the last coin flip. |U| = 2 t n = 2 b Q.E.D
Storing and Evaluating Hash Function h (M) 39 Q: How much space do we need to store the random matrix M? A: bt bits = O(log|U|log(n)) How much time to evaluate Mz? A: Naïve: bt 2 =O(log|U|log(n)) Summary: H is a relatively fast, and practical universal family of hash functions
Another Possible Family 40 We’re hashing from U -> {0, 1, …, n-1} Let H be the set of all such functions Question: Is H universal?
Another Possible Family 41 We’re hashing from U -> {0, 1, …, n-1} Q1: # such functions? A1: n U Q2: # functions in which h(x) = h(y)=j? A2: n U-2 Q3: # functions in which h(x) = h(y)? A3: nn U-2 = n U-1 Q4: Pr(h(x) = h(y)? Answer: 1/n => H is universal!
Why is H Impractical? 42 There are n U functions in H. What’s cost of storing a function h from H? log(| H |)=O(Ulog(n)! Not Practical!
Summary 1.Hash Tables 2.Defined Universal Family of Hash Functions 3.Universal family => Hash Table ops are expected O(1) time 4.Universal families exist 43
Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 44
Bloom Filters 45 Randomized Data Structure Implementing a limited version of Dictionary ADT Insert Lookup Compared to Hash Tables: Applications Website caches for ISPs
Same Setup As Hash Tables 46 Universe U of all possible elements All possible 2 32 IP addresses Maintain a subset S ⊆ U |S|=m and |U| >> m
Bloom Filters 47 A Bloom Filter consists of: A bit array of size n initially all 0 (not buckets) k hash functions h 1, …, h k Space cost per element= n/m
Insertions 48 Insert(a): set all h i (a) to 1 => O(k) Let k = 3 h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 h 1 (y)=1, h 2 (y)=5 h 3 (y)=9 h 1 (z)=10, h 2 (z)=11 h 3 (z)=5 Do you see why there would be false positives?
Lookup 49 Lookup(a): return true if all h i (a) = 1 => O(k) x: h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 => Lookup(x) = true z: h 1 (z)=3, h 2 (z)=9 h 3 (z)=4 => Lookup(z) = false t: h 1 (t)=0, h 2 (t)=1 h 3 (t)=2 => Lookup(t) = true
Can Bloom Filters Be Useful? 50 Can Bloom Filters be both space efficient and have a low false positive rate? What is the probability of false positives as a function of n, m and k?
Probability of False Positive 51 We have inserted m elements to the bloom filter. New element z arrives, not been inserted before. Q: What’s the Pr(false positive for z)? Assume h 1 (z) = j 1, …, h k (z) = j k **Simplifying (Unjustified) Assumption**: All hashing is totally random! ∀ h i, ∀ x, h i (x) is uniformly random from {1, …, m} and independent from all h j (y) for all y. Warning: To simplify analysis. Won’t hold in practice.
Pr(bit j is 1 after m insertions)? 52 Consider a particular bit j in the array. Q1: Fix h i and an element x. Pr(h i (x) turns j to 1)? A1: 1/n Q2: Pr(x turns j to 1)? (Prob. one of h 1 (x), …, h k (x) = j?) A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n) k Q3: Pr(Bit j = 1 after m insertions)? A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n) km
Pr(false positive for x)? 53 Recall for x we check k bits: h 1 (x) = j 1, …, h k (x) = j k Pr(bit j i = 1) = 1 – (1-1/n) km Pr(false positive) = Pr(all j i = 1)= (1 – (1-1/n) km ) k Recall Calculus fact: (1+x) ≤ e x From the same fact: around x=0, (1+x) ≈ e x Pr(false positive) ≈
How Does Failure Rate Change With k,n? 54 Observation 1: as n increases failure rate decreases. Observation 2: as k increases(the # hash functions) more bits to check => less likely to fail more bits/object => more likely to fail unclear if it increases or decreases Question: What’s the optimal k for fixed n/m? Answer (by taking derivatives): k=ln(2)n/m = 0.69*n/m Failure rate =
How Does Failure Rate Change With k,n? 55 For fixed n/m, with optimal k=ln(2)n/m Failure rate: Already at n=8m, rate is 1-2%. Exponentially decrease with n/m.
56 Next Week Dynamic Programming