Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break.

Hashing CSE 331 Section 2 James Daly

Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break

Review: Sets Containers for determining membership in a group Elements are unique Two main types Ordered tree sets Unordered hash sets LanguageOrderedUnordered C++setunordered_set JavaTreeSetHashSet C#SortedSetHashSet

Review: Set Operations Add / Insert Remove / Delete Exists / Find Size / IsEmpty Iterator Clear / RemoveAll Sometimes Union (AddAll) / Intersection (RetainAll)

Direct Addressing Table An element with key k is stored in slot k Search(T, k) = O(1) Insertion(T, k) = O(1) Deletion(T, k) = O(1) Problem: number of keys can be large (2 32 ) 1 2 679 T:

Hashing Store an element with key k in h(k) h(k) maps the universe U of keys into slots of a hash table Example T with slots [0, 1, …, m – 1] h: U → {0, 1, …, m – 1} Key → O(1) hash → address

Diagram U Actual keys 0 1 2 m – 1 T h(k)

Example Students with unique IDs A: 10001 B: 10002 C: 10003 h(s) = s.id % 10

Problem What if several keys hash to the same value? Several solutions Knock out the old Discard the new Chaining (keep a list) Probing (try another location)

Chaining 0 1 2 m – 1 T A B C ABC

Chained Hash Insert(T, k) Insert k into the list T[h(k)]: O(1) Search(T, k) Search for an element with key k in list T[h(k)] O(|T[h(k)]|): the size of the chain at h(k) Deletion Delete element with key k in list T[h(k)]: O(|T[h(k)]|)

Chained Hash 0 1 2 m – 1 T Lots of stuff 0 1 2 m – 1 T BadGood

Analysis Assumption: simple uniform hashing Each key is equally likely to be hashed to any slot Independent of the other keys Load Factor: average number of keys per slot α = n / m Expected search cost: Θ(1 + α): hash cost + search through the list Θ(1) if α = O(1)

Analysis Load factor is more important than the table size!

Birthday Problem What is the probability that there will be no collisions? Approximately 45 people in the room Probability everyone has a different birthday? Load factor: 12.3%

Hashing Two central problems Design a good hash function Distributes keys uniformly into the table Regularity in distribution should not affect the uniformity (shouldn’t use only half the slots with even numbers) Resolve collisions

Hash Functions A good hash function: Has equal probability of hashing a key in each slot Must be fast

Sample Hash Function Hash function for integers h(x) = x mod b For some constant b Consider b = 2 r 1011101 2 mod 2 3 = 101 2 = 5 h(x) returns the last r bits Not good! Too easy to game. Typically b is chosen to be prime

String hash functions “pt” = (ascii values) Sum of ascii values 112 + 116 = 228 Same as for “tp” (bad) Weighted sum 112 * 1 + 116 * 2 = 344 Same as “rs”: 114 * 1 + 115 * 2 = 344

String hash functions

Other hash functions Lots of them! Murmur Hash Fowler-Noll-Vo Some have different purposes Crypographic (non-invertible) Used to validate integrity of message SHA-1 MD5

Open vs Closed Addressing Talked about Closed Addressing Item always ends up in the same slot Uses chaining or similar structure Open Addressing Item may end up in different location Probes alternate locations if the item isn’t found

Open Addressing No storage is used outside of the table itself Insertion systematically probes the table until an empty slot is found Hash function depends on both the keys and the probe number h : U x {0, 1, …, m – 1} → {0, 1, …, m – 1} Probe sequence should be a permutation of {0, 1, … m – 1}

Linear Probing Given ordinary hash function h’(k), h(k, i) = h’(k) + i mod m Example: h’(k) = k h(k, i) = (k mod 11) + i) mod 11

Example 03214567 10 98 Insert 15: h’(15) = 15 mod 11 = 4 h(15, 0) = 4 15 Insert 4: h’(4) = 4 h(4, 0) = 4 h(4, 1) = 5 Insert 16: h’(16) = 16 mod 11 = 5 h(16, 0) = 5 h(15, 1) = 5 + 1 = 6 4 16 Primary Clustering

Double Hashing Given two ordinary hash function h 1 (k) and h 2 (k) h(k, i) = (h 1 (k) + i * h 2 (k)) mod m h 2 must be non-zero

Example 032791694567 10 99850 1112 h 1 = k mod 13 h 2 = 1 + (k mod 11) Insert 14 h 1 (14) = 14 mod 13 = 1 h 2 (14) = 1 + (14 mod 1) = 4 h(14, 0) = 1 h(14, 1) = 1 + 4 = 5 14 Delete 72 h 1 (72) = 72 mod 13 = 7 h(72, 0) = 7 72

Example 03279169414567 10 99850 1112 h 1 = k mod 13 h 2 = 1 + (k mod 11) Delete 98 h 1 (98) = 7 h 2 (98) = 11 h(98, 0) = 7 h(98, 1) = 5 h(98, 2) = 3 When can we stop?

Rehashing Efficiency degrades as load factor increases Dependent on number of items and table size Need to increase the table size occasionally to when adding items Need to move items Requires slight adjustments to the hash function Mod by new table size

Rehashing Rehashing requires Θ(n) time Don’t increase size by a fixed amount Causes average time to also be Θ(n) Grow by a multiplicative factor instead (double) Θ(n) once every Θ(n) inserts Amortized Θ(n) time

Rehashing 709, 73323215141158660315214567723 10 9973, 86811 12

Applications – Pattern Matching For a given string, sub, test whether it is a substring of another (larger) string S ACGT S Sub = “ACGT” |S| = n |Sub| = m Cost = O(m n)

RabinKarp(s[1..n], sub[1..m]) hsub ← hash(sub[1..m]) For i = 1 to n – m + 1 hs ← hash(s[1..m]) If hs = hsub If s[i..i+m-1] = sub Return i Return not found String comparison: h(s1) = h(s2) does not mean s1 = s2

Bloom-Filter Set membership detection Space-efficient data structure use to test for membership of a set Uses several hash functions and a bitset Each h i (k) must be set to be in the set Allows false positives, but not false negatives Probably “yes”, definitely “no”

Bloom-Filter 111111 {X, Y, Z}

Map / Dictionary Abstract data type representing a partial function Relates keys to values Keys are unique Values might not be Two main types (like sets) Ordered tree maps Unordered hash maps LanguageOrderedUnordered C++mapunordered_map JavaTreeMapHashMap C#SortedDictionaryDictionary

Map / Dictionary Keys Values

Map / Dictionary Methods Insert / Put: inserts tuple Get / At: gets value from key Often indexer (operator[]) to do both put and get Remove / Delete: removes tuple by key KeyExists Iterator Size / IsEmpty Clear

TreeMap John 555-3612 Jacob 555-3147 Mary 555-1243 Mathew 555-2179 Luke 555-7293 Mark 555-3479 Sarah 555-5394 Key: Name Value: Cell # Mary? 555-1243

Hash Map Sarah, 555-53940 1 John, 555-36122 Jacob, 555-31473 4 Mary, 555-12435 Mathew, 555-21796 Mark, 555-34797 8 Luke, 555-72939 Mary? h(Mary) = 5 555-1243

Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break.

Similar presentations

Presentation on theme: "Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break.

Similar presentations

Presentation on theme: "Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break."— Presentation transcript:

Similar presentations

About project

Feedback