CSE 326: Data Structures Hashing

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashing CS 3358 Data Structures.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CSE 373 Data Structures and Algorithms Lecture 17: Hashing II.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
CSE 326: Data Structures Lecture #15 When Keys Collide
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Hashing CSE 2011 Winter July 2018.
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing - resolving collisions
Hashing Alexandra Stefan.
Hash Tables (Chapter 13) Part 2.
CSE 373: Hash Tables Hunter Zahn Summer 2016 UW CSE 373, Summer 2016.
Handling Collisions Open Addressing SNSCT-CSE/16IT201-DS.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Dictionaries 9/14/ :35 AM Hash Tables   4
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Advanced Associative Structures
Instructor: Lilian de Greef Quarter: Summer 2017
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
Dictionaries and Their Implementations
Resolving collisions: Open addressing
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Tree traversal preorder, postorder: applies to any kind of tree
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
EE 312 Software Design and Implementation I
CSE 326: Data Structures Lecture #12 Hashing II
Collision Handling Collisions occur when different elements are mapped to the same cell.
Hashing.
Data Structures and Algorithm Analysis Hashing
DATA STRUCTURES-COLLISION TECHNIQUES
17CS1102 DATA STRUCTURES © 2018 KLEF – The contents of this presentation are an intellectual and copyrighted property of KL University. ALL RIGHTS RESERVED.
EE 312 Software Design and Implementation I
CSE 326: Data Structures Lecture #14
Collision Resolution: Open Addressing Extendible Hashing
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

CSE 326: Data Structures Hashing Ben Lerner Summer 2007

Collision Resolution Collision: when two keys map to the same location in the hash table. Two ways to resolve collisions: Separate Chaining Open Addressing (linear probing, quadratic probing, double hashing)

Separate Chaining Insert: 10 22 107 12 42 1 2 3 4 5 6 7 8 9 Separate chaining: All keys that map to the same hash value are kept in a list (or “bucket”). Find(42) Find(16) Findmax() What is a bucket? - LL, handle like a splay tree, insert at front, - or any other dictionary. (BST, hash)

Separate Chaining a b h(a) = h(b) and h(d) = h(g) 1 a b h(a) = h(b) and h(d) = h(g) Chains may be ordered or unordered. Little advantage to ordering. 2 3 4 d g 5 6 f 7 8 c 9

How big to make the table? Analysis of find Defn: The load factor, , of a hash table is the ratio:  no. of elements  table size For separate chaining,  = average # of elements in a bucket Unsuccessful find: Successful find:  (avg. length of a list at hash(k)) 1 + (/2) (one node, plus half the avg. length of a list (not including the item)).

How big should the hash table be? For Separate Chaining: Lambda = # elements/size of table = 1, DO NOT WRITE THIS: could be as small as ½ but any less wastes space. So tableSize = N – also make it a prime # (unlike my examples)

Closed Hashing (Open Addressing) No chaining, every key fits in the hash table. Probe sequence h(k) (h(k) + f(1)) mod HSize (h(k) + f(2)) mod HSize , … Insertion: Find the first probe with an empty slot. Find: Find the first probe that equals the query or is empty. Stop at HSize probe, in any case. Deletion: lazy deletion is needed. That is, mark locations as deleted, if a deleted key resides there.

Open Addressing WAIT – come back Could have a bad hash function that hashes everything to the same bucket – but other ways to get a lot of conflicts 8 and 109 actually conflicted by hashing to the same location as 38 and 19 BUT -as we try to place 8, 19 also got in the way, -as we try to place 109, 8 got in the way 10 – nothing else had been hashed to that location – yet 2 conflicts Insert: 38 19 8 109 10 1 2 3 4 5 6 7 8 9 8 109 10 Find(8) Find(29) Delete? -sep chaining, -here? Lazy deletion (otherwise?? Rehash everything occasionally) Linear Probing: after checking spot h(k), try spot h(k)+1, if that is full, try h(k)+2, then h(k)+3, etc. 38 19

Linear Probing f(i) = i Probe sequence Insertion (assuming  < 1) h(k) (h(k) + 1) mod HSize (h(k) + 2) mod HSize … Insertion (assuming  < 1) h := h(k) while T(h) not empty do h := (h + 1) mod HSize; insert k in T(h)

Terminology Alert! “Open Hashing” equals “Separate Chaining” “Closed Hashing” equals “Open Addressing” Weiss

Linear Probing Example 76 93 40 47 10 55 47 47 47 1 1 1 1 1 1 55 2 2 93 2 93 2 93 2 93 2 93 3 3 3 3 3 10 3 10 4 4 4 4 4 4 5 5 5 40 5 40 5 40 5 40 6 76 6 76 6 76 6 76 6 76 6 76 Probes 1 1 1 3 1 3

Write Pseudocode for find(k) for open addressing w/ linear probing Find(k) returns i, the index where key k is actually stored

Linear Probing – Clustering no collision collision in small cluster no collision collision in large cluster [R. Sedgewick]

Load Factor in Linear Probing Math complex b/c of clustering Load Factor in Linear Probing For any  < 1, linear probing will find an empty slot Expected # of probes (for large table sizes) successful search: unsuccessful search: Linear probing suffers from primary clustering Performance quickly degrades for  > 1/2 Probes = 2.5 for  = 0.5 Probes = 50.5 for  = 0.9 Also insertions There’s some really nasty math that shows that (because of things like primary clustering), each successful search w/linear probing costs… That’s 2.5 probes per unsuccessful search for lambda = 0.5. 50.5 comparisons for lambda = 0.9 We don’t want to let lambda get above 1/2 for linear probing. Keep lambda < 1/2 Book has nice graph of this p. 179

Quadratic Probing f(i) = i2 Probe sequence: Less likely to encounter Primary Clustering f(i) = i2 Probe sequence: 0th probe = h(k) mod TableSize 1th probe = (h(k) + 1) mod TableSize 2th probe = (h(k) + 4) mod TableSize 3th probe = (h(k) + 9) mod TableSize . . . ith probe = (h(k) + i2) mod TableSize Add a function of I to the original hash value to resolve the collision. Less likely to encounter primary clustering, but could run into secondary clustering. Although keys that hash to the same initial location will still use the same sequence of probes (and conflict with each other). How big to make hash table? Lambda = ½, (hash table is twice as big as the number of elements expected.) Note: (i +1)2 – i2 = 2i + 1 Thus, to get to the NEXT step, you can add 2* current value of i plus one. Implementation trick: f(i+1) = No multiplication! f(i+1) = f(i) + 2i + 1

Quadratic Probing 1 2 3 4 5 6 7 8 9 49 + 1 79 + 1 Insert: 89 18 49 58 1 2 3 4 5 6 7 8 9 49 + 1 79 + 1 Insert: 89 18 49 58 79 58 + 4 79 + 4 18 58 + 0 49 + 0 58 + 1 79 + 0 89

Quadratic Probing Example 76 3 2 1 6 5 4 insert(76) 76%7 = 6 insert(40) 40%7 = 5 insert(48) 48%7 = 6 insert(5) 5%7 = 5 insert(55) 55%7 = 6 48 insert(47) 47%7 = 5 But… 5 0: 5 1: 6 4: 2 9: 0 16: 0 25: 2 Never finds spot! 55 And, it works reasonably well. Here it does fine until the table is more than half full. Let’s take a closer look at what happens when the table is half full. 40

Quadratic Probing: Success guarantee for  < ½ First size/2 probes distinct. If < half full, one is empty. If size is prime and  < ½, then quadratic probing will find an empty slot in size/2 probes or fewer. show for all 0  i,j  size/2 and i  j (h(x) + i2) mod size  (h(x) + j2) mod size by contradiction: suppose that for some i  j: (h(x) + i2) mod size = (h(x) + j2) mod size  i2 mod size = j2 mod size  (i2 - j2) mod size = 0  [(i + j)(i - j)] mod size = 0 BUT size does not divide (i-j) or (i+j) First size/2 probes will be distinct, and if less than half of table is full then after size/2 probes you will find one of those empty spot ith probe and jth probe One of these must be = 0 when mod size Well, we can guarantee that this succeeds for loads under 1/2. And, in fact, for large tables it _usually_ succeeds as long as we’re not close to 1. But, it’s not guaranteed! Size would need to divide one of these but how can i + j = 0 or i + j = size when i  j and i,j  size/2? same for i - j mod size = 0

Quadratic Probing may Fail if  > 1/2 51 mod 7 = 2 ; i = 0 (2 + 1) mod 7 = 3; i = 1 (3 + 3) mod 7 = 6; i = 2 (6 + 5) mod 7 = 4; i = 3 (4 + 7) mod 7 = 4; i = 4 (4 + 9) mod 7 = 6; i = 5 (6 + 11) mod 7 = 3; i = 6 (3 + 13) mod 7 = 2, i = 7 … 51 1 2 16 3 45 4 59 5 6 76

Quadratic Probing: Properties For any  < ½, quadratic probing will find an empty slot; for bigger , quadratic probing may find a slot Quadratic probing does not suffer from primary clustering: keys hashing to the same area are not bad But what about keys that hash to the same spot? Secondary Clustering! At lambda = 1/2 about 1.5 probes per successful search That’s actually pretty close to perfect… but there are two problems. First, we might fail if the load factor is above 1/2. Second, quadratic probing still suffers from secondary clustering. That’s where multiple keys hashed to the same spot all follow the same probe sequence. How can we solve that? Secondary clustering. Not obvious from looking at table. multiple keys hashed to the same spot all follow the same probe sequence.

Double Hashing f(i) = i g(k) where g is a second hash function Probe sequence h(k) (h(k) + g(k)) mod HSize (h(k) + 2g(k)) mod HSize (h(k) + 3g(k)) mod HSize, … In choosing g care must be taken so that it never evaluates to 0. A good choice for gis to choose a prime R < HSize and let g(k) = R – (k mod R).

Double Hashing Example h(k) = k mod 7 and g(k) = 5 – (k mod 5) 76 93 40 47 10 55 1 1 1 1 47 1 47 1 47 2 2 93 2 93 2 93 2 93 2 93 3 3 3 3 3 10 3 10 4 4 4 4 4 4 55 5 5 5 40 5 40 5 40 5 40 6 76 6 76 6 76 6 76 6 76 6 76 Probes 1 1 1 2 1 2

Resolving Collisions with Double Hashing 1 2 3 4 5 6 7 8 9 Hash Functions: H(K) = K mod M H2(K) = 1 + ((K/M) mod (M-1)) M = Insert these values into the hash table in this order. Resolve any collisions with double hashing: 13 28 33 147 43

Double Hashing is Safe for  < 1 Let h(k) = k mod p and g(k) = q – (k mod q) where 2 < q < p and p and q are primes. The probe sequence h(k) + ig(k) mod p probes every entry of the hash table. Let 0 < m < p, h = h(k), and g = g(k). We show that h+ig mod p = m for some i. 0 < g < p, so g and p are relatively prime. By extended Euclid’s algorithm that are s and t such that sg + tp = 1. Choose i = (m-h)s mod p (h + ig) mod p = (h + (m-h)sg) mod p = (h + (m-h)sg + (m-h)tp) mod p = (h + (m-h)(sg + tp) mod p = (h + (m-h)) mod p = m mod p = m

Deletion in Hashing Open hashing (chaining) – no problem Closed hashing – must do lazy deletion. Deleted keys are marked as deleted. Find: done normally Insert: treat marked slot as an empty slot and fill it. Insert 30 h(k) = k mod 7 Linear probing Find 59 1 1 2 16 2 16 3 23 3 30 4 59 4 59 5 5 6 76 6 76

Rehashing Idea: When the table gets too full, create a bigger table (usually 2x as large) and hash all the items from the original table into the new table. When to rehash? half full ( = 0.5) when an insertion fails some other threshold Cost of rehashing? Go through old hash table, ignoring items marked deleted Recompute hash value for each non-deleted key and put the item in new position in new table Cannot just copy data from old table because the bigger table has a new hash function Running time?? O(N) – but infrequent. But Not good for real-time safety critical applications O(N) but infrequent

Rehashing Example Open hashing – h1(x) = x mod 5 rehashes to h2(x) = x mod 11. 0 1 2 3 4  = 1 37 83 52 98 0 1 2 3 4 5 6 7 8 9 10  = 5/11 37 83 52 98

Rehashing Picture Starting with table of size 2, double when load factor > 1. hashes rehashes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25

Amortized Analysis of Rehashing Cost of inserting n keys is < 3n 2k + 1 < n < 2k+1 Hashes = n Rehashes = 2 + 22 + … + 2k = 2k+1 – 2 Total = n + 2k+1 – 2 < 3n Example n = 33, Total = 33 + 64 –2 = 95 < 99

Case Study Spelling Dictionary - 30,000 words Goals Possible solutions Fast spell checking Minimal storage Possible solutions Sorted array and binary search Open hashing (chaining) Closed hashing with linear probing Notes Almost all searches are successful 30,000 word average 8 bytes per word, 240,000 bytes Pointers are 4 bytes

Storage Assume word are stored as strings and entries in the arrays are pointers to the strings. Binary search Open hashing Closed hashing N pointers N/ + 2N pointers N/ pointers

Analysis Binary Search Open hashing Closed hashing Storage = N pointers + words = 360,000 bytes Time = log2N < 15 probes in worst case Open hashing Storage = 2N + N/  pointers + words  = 1 implies 600,000 bytes Time = 1 + /2 probes per access  = 1 implies 1.5 probes per access Closed hashing Storage = N/  pointers + words  = 1/2 implies 480,000 bytes Time = (1/2)(1+1/(1-)) probes  = 1/2 implies 1.5 probes per access

Extendible Hashing Extendible hashing is a technique for storing large data sets that do not fit in memory. An alternative to B-trees 3 bits of hash value used 000 001 010 011 100 101 110 111 In memory Pages (2) 00001 00011 00100 00110 (3) 10001 10011 10101 10110 10111 11001 11011 11100 11110 01001 01011 01100

Splitting 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (3) 10001 10011 10101 10110 10111 11001 11011 11100 11110 01001 01011 01100 Insert 11000 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (3) 11000 11001 11011 (3) 11100 11110

Rehashing 00 01 10 11 Insert 00111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (2) 10000 10001 10011 (2) 11101 11110 000 001 010 011 100 101 110 111 (3) 00001 00011 (3) 00100 00110 00111 (2) 01001 01011 01100 (2) 10000 10001 10011 (2) 11101 11110

Fingerprints Given a string x we want a fingerprint x’ with the properties. x’ is short, say 128 bits Given x  y the probability that x’ = y’ is infintesimal (almost zero) Computing x’ is very fast MD5 - Message Digest Algorithm 5 is a recognized standard Applications in databases and cryptography

Fingerprint Math Given 128 bits and N strings what is the probability that the fingerprints of two strings coincide? This is essentially zero for N < 240.

Hashing Summary Hashing is one of the most important data structures. Hashing has many applications where operations are limited to find, insert, and delete. Dynamic hash tables have good amortized complexity. Extendible hashing is useful in databases. Fingerprints good for databases and crypto.