Download presentation
Presentation is loading. Please wait.
Published byJohnathan Warner Modified over 9 years ago
1
Hashing CSE 331 Section 2 James Daly
2
Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break
3
Review: Sets Containers for determining membership in a group Elements are unique Two main types Ordered tree sets Unordered hash sets LanguageOrderedUnordered C++setunordered_set JavaTreeSetHashSet C#SortedSetHashSet
4
Review: Set Operations Add / Insert Remove / Delete Exists / Find Size / IsEmpty Iterator Clear / RemoveAll Sometimes Union (AddAll) / Intersection (RetainAll)
5
Direct Addressing Table An element with key k is stored in slot k Search(T, k) = O(1) Insertion(T, k) = O(1) Deletion(T, k) = O(1) Problem: number of keys can be large (2 32 ) 1 2 679 T:
6
Hashing Store an element with key k in h(k) h(k) maps the universe U of keys into slots of a hash table Example T with slots [0, 1, …, m – 1] h: U → {0, 1, …, m – 1} Key → O(1) hash → address
7
Diagram U Actual keys 0 1 2 m – 1 T h(k)
8
Example Students with unique IDs A: 10001 B: 10002 C: 10003 h(s) = s.id % 10
9
Problem What if several keys hash to the same value? Several solutions Knock out the old Discard the new Chaining (keep a list) Probing (try another location)
10
Chaining 0 1 2 m – 1 T A B C ABC
11
Chained Hash Insert(T, k) Insert k into the list T[h(k)]: O(1) Search(T, k) Search for an element with key k in list T[h(k)] O(|T[h(k)]|): the size of the chain at h(k) Deletion Delete element with key k in list T[h(k)]: O(|T[h(k)]|)
12
Chained Hash 0 1 2 m – 1 T Lots of stuff 0 1 2 m – 1 T BadGood
13
Analysis Assumption: simple uniform hashing Each key is equally likely to be hashed to any slot Independent of the other keys Load Factor: average number of keys per slot α = n / m Expected search cost: Θ(1 + α): hash cost + search through the list Θ(1) if α = O(1)
14
Analysis Load factor is more important than the table size!
15
Birthday Problem What is the probability that there will be no collisions? Approximately 45 people in the room Probability everyone has a different birthday? Load factor: 12.3%
16
Hashing Two central problems Design a good hash function Distributes keys uniformly into the table Regularity in distribution should not affect the uniformity (shouldn’t use only half the slots with even numbers) Resolve collisions
17
Hash Functions A good hash function: Has equal probability of hashing a key in each slot Must be fast
18
Sample Hash Function Hash function for integers h(x) = x mod b For some constant b Consider b = 2 r 1011101 2 mod 2 3 = 101 2 = 5 h(x) returns the last r bits Not good! Too easy to game. Typically b is chosen to be prime
19
String hash functions “pt” = (ascii values) Sum of ascii values 112 + 116 = 228 Same as for “tp” (bad) Weighted sum 112 * 1 + 116 * 2 = 344 Same as “rs”: 114 * 1 + 115 * 2 = 344
20
String hash functions
21
Other hash functions Lots of them! Murmur Hash Fowler-Noll-Vo Some have different purposes Crypographic (non-invertible) Used to validate integrity of message SHA-1 MD5
22
Open vs Closed Addressing Talked about Closed Addressing Item always ends up in the same slot Uses chaining or similar structure Open Addressing Item may end up in different location Probes alternate locations if the item isn’t found
23
Open Addressing No storage is used outside of the table itself Insertion systematically probes the table until an empty slot is found Hash function depends on both the keys and the probe number h : U x {0, 1, …, m – 1} → {0, 1, …, m – 1} Probe sequence should be a permutation of {0, 1, … m – 1}
24
Linear Probing Given ordinary hash function h’(k), h(k, i) = h’(k) + i mod m Example: h’(k) = k h(k, i) = (k mod 11) + i) mod 11
25
Example 03214567 10 98 Insert 15: h’(15) = 15 mod 11 = 4 h(15, 0) = 4 15 Insert 4: h’(4) = 4 h(4, 0) = 4 h(4, 1) = 5 Insert 16: h’(16) = 16 mod 11 = 5 h(16, 0) = 5 h(15, 1) = 5 + 1 = 6 4 16 Primary Clustering
26
Double Hashing Given two ordinary hash function h 1 (k) and h 2 (k) h(k, i) = (h 1 (k) + i * h 2 (k)) mod m h 2 must be non-zero
27
Example 032791694567 10 99850 1112 h 1 = k mod 13 h 2 = 1 + (k mod 11) Insert 14 h 1 (14) = 14 mod 13 = 1 h 2 (14) = 1 + (14 mod 1) = 4 h(14, 0) = 1 h(14, 1) = 1 + 4 = 5 14 Delete 72 h 1 (72) = 72 mod 13 = 7 h(72, 0) = 7 72
28
Example 03279169414567 10 99850 1112 h 1 = k mod 13 h 2 = 1 + (k mod 11) Delete 98 h 1 (98) = 7 h 2 (98) = 11 h(98, 0) = 7 h(98, 1) = 5 h(98, 2) = 3 When can we stop?
29
Rehashing Efficiency degrades as load factor increases Dependent on number of items and table size Need to increase the table size occasionally to when adding items Need to move items Requires slight adjustments to the hash function Mod by new table size
30
Rehashing Rehashing requires Θ(n) time Don’t increase size by a fixed amount Causes average time to also be Θ(n) Grow by a multiplicative factor instead (double) Θ(n) once every Θ(n) inserts Amortized Θ(n) time
31
Rehashing 709, 73323215141158660315214567723 10 9973, 86811 12
32
Applications – Pattern Matching For a given string, sub, test whether it is a substring of another (larger) string S ACGT S Sub = “ACGT” |S| = n |Sub| = m Cost = O(m n)
33
RabinKarp(s[1..n], sub[1..m]) hsub ← hash(sub[1..m]) For i = 1 to n – m + 1 hs ← hash(s[1..m]) If hs = hsub If s[i..i+m-1] = sub Return i Return not found String comparison: h(s1) = h(s2) does not mean s1 = s2
34
Bloom-Filter Set membership detection Space-efficient data structure use to test for membership of a set Uses several hash functions and a bitset Each h i (k) must be set to be in the set Allows false positives, but not false negatives Probably “yes”, definitely “no”
35
Bloom-Filter 111111 {X, Y, Z}
36
Map / Dictionary Abstract data type representing a partial function Relates keys to values Keys are unique Values might not be Two main types (like sets) Ordered tree maps Unordered hash maps LanguageOrderedUnordered C++mapunordered_map JavaTreeMapHashMap C#SortedDictionaryDictionary
37
Map / Dictionary Keys Values
38
Map / Dictionary Methods Insert / Put: inserts tuple Get / At: gets value from key Often indexer (operator[]) to do both put and get Remove / Delete: removes tuple by key KeyExists Iterator Size / IsEmpty Clear
39
TreeMap John 555-3612 Jacob 555-3147 Mary 555-1243 Mathew 555-2179 Luke 555-7293 Mark 555-3479 Sarah 555-5394 Key: Name Value: Cell # Mary? 555-1243
40
Hash Map Sarah, 555-53940 1 John, 555-36122 Jacob, 555-31473 4 Mary, 555-12435 Mathew, 555-21796 Mark, 555-34797 8 Luke, 555-72939 Mary? h(Mary) = 5 555-1243
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.