Download presentation
Presentation is loading. Please wait.
Published byAusten Stevenson Modified over 9 years ago
1
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1
2
Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 2
3
Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 3
4
Hash Tables 4 Randomized Data Structure Implementing the “Dictionary” abstract data type (ADT) Insert Delete Lookup We’ll assume no duplicates Applications Symbol Tables/Compilers: which variables already declared? ISPs: is IP address spam/blacklisted? many others
5
Setup 5 Universe U of all possible elements All possible 2 32 IP addresses All possible variables that can be declared Maintain a possibly evolving subset S ⊆ U |S|=m and |U| >> m S might be evolving
6
Naïve Dictionary Implementations (1) 6 Option 1: Bit Vectors An array A, keeping one bit 0/1 for each element of U Insert element i => A[i] = 1 Delete element i => A[i] = 0 Lookup element i => return A[i] == 1 Time complexity of every operation is O(1) Space: O(U) (e.g. 2 32 for IP addresses) Quick but Not Scalable!
7
Naïve Dictionary Implementations (2) 7 Option 2: Linked List One entry for each element in S Insert element i => check if i exists, if not append to list Delete element i => find i in the list and remove Lookup element i => go through the entire list Time complexity of every operation is O(|S|) Space: O(S) space Scalable but Not Quick!
8
Hash Tables: Best of Both Worlds 8 Randomized Dictionary that is: Quick: O(1) expected time for each operation Scalable: O(S) space
9
Hash Tables: High-level Idea 9 Buckets: distinct locations in the hash table Let n be # buckets n ≈ m (recall m =|S|) i.e., Load factor: m/n = O(1) Hash Function h: U -> {0, 1, …, n-1} We store each element x in bucket h(x) U: universe m = size of S n = # buckets
10
Hash Tables: High-level Idea 10 0 1.. n- 2 n- 1 h U … U: universe m = size of S n = # buckets
11
Collisions 11 Multiple elements are hashed to the same bucket. Assume we are about to insert new x and h(x) is already full Resolving Collisions: Chaining: Linked list per bucket; append x to the list Open Addressing: If h(x) is already full, we deterministically assign x to another empty bucket Saves space
12
Chaining 12 0 1.. n-2 n-1 Null … e 3 = h(e 3 )=1 U: universe m = size of S n = # buckets
13
Chaining 13 0 1.. n-2 n-1 Null … e3e3 e 7 = h(e 7 )=n-2 U: universe m = size of S n = # buckets
14
Chaining 14 0 1.. n-2 n-1 Null … e3e3 e 5 = h(e 5 )=n-1 e7e7 U: universe m = size of S n = # buckets
15
Chaining 15 0 1.. n-2 n-1 Null … e3e3 e 1 = h(e 1 )=1 e7e7 e5e5 U: universe m = size of S n = # buckets
16
Chaining 16 0 1.. n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e 4 = h(e 4 )=1 U: universe m = size of S n = # buckets
17
Chaining 17 0 1.. n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 U: universe m = size of S n = # buckets
18
Operations (With Chaining) 18 Insert(x): Go to bucket h(x); If x is not in list, append it. Delete(x): Go to bucket h(x); If x is in list, delete it. Lookup(x): Go to bucket h(x); Return true if x is in the list
19
Running Time of Operations 19 Assume evaluating the hash function is constant time May not be true for all hash functions Consider an element x 0 1.. n-2 n-1 Null … e3e3 e7e7 e5e5 e1e1 e4e4 Lookup: O(|Linked list h(x)|) Insert: O(|Linked list h(x)|) Delete: O(|Linked list h(x)|) U: universe m = size of S n = # buckets
20
Worst & Best Scenarios 20 Assume m: # elements in the hash table Worst Case: O(m) Best Case: O(1) |Linked lists| depends on the quality of the hash function! Fundamental Question: How can we choose “good” hash functions? U: universe m = size of S n = # buckets
21
Bad Hash Functions 21 Recall our IP addresses example: 32 bits # buckets n = 2 8 Idea: Use most significant 8 bits Big correlations with geography of how IP addresses are assigned: 171, 172 as the first 8 bits is common Lots of addresses would get mapped to the same bucket In practice should be very careful when picking hash functions! U: universe m = size of S n = # buckets
22
Is There A Single Good Hash Function? 22 Idea: Design a clever hash function **h** that spreads every data sets evenly across the buckets. Problem: Cannot exist! 0 1.. n- 2 n- 1 **h** U Recall |U| >> m ≈n by pigeonhole: ∃ bucket i, s.t. |list i| ≥ |U|/n If S is all from i, then all operations O(m)!
23
No Single Good Hash Function! 23 Claim: For every single hash function h, there is a pathological data set! Proof: By pigeonhole principle
24
Solution: Pick a Hash Function Randomly 24 Design a set or a “family” H of hash functions, s.t. ∀ data sets S, if we pick a h ∈ H randomly, then almost always we spread S out evenly across buckets. Question: Why couldn’t you have randomness inside your hash function?
25
Clarification on Proposed Analysis Hash Table Input:S Performance Pick h randomly from H We’ll analyze the expected performance on any but fixed input S.
26
Clarification on Proposed Analysis Hash Table Input:S Performance 1 Pick h 1 randomly from H Hash Table Input:S Performance 2 Pick h 2 randomly from H … Hash Table Input:S Performance t Pick h t randomly from H
27
Roadmap 27 1.Define H being “Universal” 2.If H is universal and we pick h ∈ H randomly, then our hash table has O(1) expected cost 3.Show simple and practical H exist.
28
1. Universal Family of Hash Functions 28 Let H be a set of functions from |U| to {0, 1, …, n-1}. Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y, if h is chosen uniformly at random from H then: Pr(h(x) = h(y)) ≤ 1/n I.e., the fraction of hash functions of H that make x & y collide is at most 1/n Why 1/n? “As if we were independently mapping x, y to buckets (& uniformly at random).” U: universe m = size of S n = # buckets
29
2. Universality => Operations Are O(1) 29 Let H be a universal family of hash functions from |U| to {0, 1, …, n-1}. Recall m = O(n) Claim: If h is picked randomly from H => for any data set S, hash table operations are O(1). U: universe m = size of S n = # buckets
30
2. Universality => Operations Are O(1) 30 Proof: U: universe m = size of S n = # buckets Hash Table S Pick h randomly from H 0 1.. n-2 n-1 … e3e3 e7e7 e5e5 e1e1 e4e4 e9e9 e 27 A new element x arrives. Say we want to perform Lookup(x). Cost: O(# elements in bucket h(x)). This quantity is a random variable. Call it Z.
31
2. Universality => Operations Are O(1) 31 Proof Continued: Z=# elements in bucket h(x). For each element y ∈ S, let X y be 1 if h(y) = h(x). U: universe m = size of S n = # buckets 1 is in case x is already there Q.E.D
32
3. Universal Families of HF Exist (1) 32 Let n=2 b, |U|=2 t and t>b Represent each x as t bit binary vector Ex: |U|=2 7 =128, hash table has size 2 4 =16 |U| = 2 t n = 2 b 0110101 1001100 1110001 0011000 0 1 1 0 1 0 0 1 1 0 0 M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 bucket 12 element 52
33
3. Universal Families of HF Exist (2) 33 h(x): Mx: 2 t -> 2 b or U -> {0, 1, …, n-1} H = All possible b x t 0/1 random matrices 0110101 1001100 1110001 0011000 0 1 1 0 1 0 0 1 1 0 0 M x h(x)=Mx Random 0/1 b x t matrix = multiplication mod 2 |U| = 2 t n = 2 b
34
Proof that H is Universal (1) 34 Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n = 1/2 b, when M is picked uniformly at random from H. => equivalently when each cell of M is picked randomly. 0110101 1001100 1110001 0011000 0 1 1 0 1 0 0 1 1 0 0 M x h(x) = |U| = 2 t n = 2 b
35
Proof that H is Universal (2) 35 x, y differ in at least one bit (say w.l.o.g., the last bit) let z = x-y z1z1 z2z2 z3z3 z4z4 z5z5 … 1 0 0 0 0 MzMz = 0110101 1001100 1110001 0011000 Q: Pr(Mz =0)? |U| = 2 t n = 2 b
36
Proof that H is Universal (3) 36 Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0) **Event Mz[i]=0 is independent from Mz[j]=0 since, the coin flips for Mz[i] are independent from the coin flips for Mz[j]** Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) … Pr(Mz[b]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … 1 0 0 0 0 M z Mz = 0110101 1001100 1110001 0011000 Q: Pr(Mz[i] =0)? |U| = 2 t n = 2 b
37
Proof that H is Universal (4) 37 Pr(Mz[i]=0) z1z1 z2z2 z3z3 z4z4 z5z5 … 1 0 0 0 0 M z Mz = 0110101 1001100 1110001 0011000 Mz[i] = m i1 z 1 + m i2 z 2 + … 1*m it Let y be the (modulo 2) sum of the first t-1 multiplications, Mz[i] = 0 iff m it is equal to ¬y! i
38
Proof that H is Universal (5) 38 Pr(Mz[i] =0) = 1/2 Pr(Mx=My)=Pr(Mz = 0) = 1/2 b = 1/n Irrespective of the fist t-1 coin flips, it all depends on the last coin flip. |U| = 2 t n = 2 b Q.E.D
39
Storing and Evaluating Hash Function h (M) 39 Q: How much space do we need to store the random matrix M? A: bt bits = O(log|U|log(n)) How much time to evaluate Mz? A: Naïve: bt 2 =O(log|U|log(n)) Summary: H is a relatively fast, and practical universal family of hash functions
40
Another Possible Family 40 We’re hashing from U -> {0, 1, …, n-1} Let H be the set of all such functions Question: Is H universal?
41
Another Possible Family 41 We’re hashing from U -> {0, 1, …, n-1} Q1: # such functions? A1: n U Q2: # functions in which h(x) = h(y)=j? A2: n U-2 Q3: # functions in which h(x) = h(y)? A3: nn U-2 = n U-1 Q4: Pr(h(x) = h(y)? Answer: 1/n => H is universal!
42
Why is H Impractical? 42 There are n U functions in H. What’s cost of storing a function h from H? log(| H |)=O(Ulog(n)! Not Practical!
43
Summary 1.Hash Tables 2.Defined Universal Family of Hash Functions 3.Universal family => Hash Table ops are expected O(1) time 4.Universal families exist 43
44
Outline For Today 1.Hash Tables and Universal Hashing 2.Bloom Filters 44
45
Bloom Filters 45 Randomized Data Structure Implementing a limited version of Dictionary ADT Insert Lookup Compared to Hash Tables: Applications Website caches for ISPs
46
Same Setup As Hash Tables 46 Universe U of all possible elements All possible 2 32 IP addresses Maintain a subset S ⊆ U |S|=m and |U| >> m
47
Bloom Filters 47 A Bloom Filter consists of: A bit array of size n initially all 0 (not buckets) k hash functions h 1, …, h k Space cost per element= n/m
48
Insertions 48 Insert(a): set all h i (a) to 1 => O(k) Let k = 3 h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 h 1 (y)=1, h 2 (y)=5 h 3 (y)=9 h 1 (z)=10, h 2 (z)=11 h 3 (z)=5 Do you see why there would be false positives?
49
Lookup 49 Lookup(a): return true if all h i (a) = 1 => O(k) x: h 1 (x)=2, h 2 (x)=9 h 3 (x)=0 => Lookup(x) = true z: h 1 (z)=3, h 2 (z)=9 h 3 (z)=4 => Lookup(z) = false t: h 1 (t)=0, h 2 (t)=1 h 3 (t)=2 => Lookup(t) = true 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
50
Can Bloom Filters Be Useful? 50 Can Bloom Filters be both space efficient and have a low false positive rate? What is the probability of false positives as a function of n, m and k?
51
Probability of False Positive 51 We have inserted m elements to the bloom filter. New element z arrives, not been inserted before. Q: What’s the Pr(false positive for z)? Assume h 1 (z) = j 1, …, h k (z) = j k **Simplifying (Unjustified) Assumption**: All hashing is totally random! ∀ h i, ∀ x, h i (x) is uniformly random from {1, …, m} and independent from all h j (y) for all y. Warning: To simplify analysis. Won’t hold in practice.
52
Pr(bit j is 1 after m insertions)? 52 Consider a particular bit j in the array. Q1: Fix h i and an element x. Pr(h i (x) turns j to 1)? A1: 1/n Q2: Pr(x turns j to 1)? (Prob. one of h 1 (x), …, h k (x) = j?) A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n) k Q3: Pr(Bit j = 1 after m insertions)? A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n) km
53
Pr(false positive for x)? 53 Recall for x we check k bits: h 1 (x) = j 1, …, h k (x) = j k Pr(bit j i = 1) = 1 – (1-1/n) km Pr(false positive) = Pr(all j i = 1)= (1 – (1-1/n) km ) k Recall Calculus fact: (1+x) ≤ e x From the same fact: around x=0, (1+x) ≈ e x Pr(false positive) ≈
54
How Does Failure Rate Change With k,n? 54 Observation 1: as n increases failure rate decreases. Observation 2: as k increases(the # hash functions) more bits to check => less likely to fail more bits/object => more likely to fail unclear if it increases or decreases Question: What’s the optimal k for fixed n/m? Answer (by taking derivatives): k=ln(2)n/m = 0.69*n/m Failure rate =
55
How Does Failure Rate Change With k,n? 55 For fixed n/m, with optimal k=ln(2)n/m Failure rate: Already at n=8m, rate is 1-2%. Exponentially decrease with n/m.
56
56 Next Week Dynamic Programming
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.