CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions

Slides:



Advertisements
Similar presentations
Preliminaries Advantages –Hash tables can insert(), remove(), and find() with complexity close to O(1). –Relatively easy to program Disadvantages –There.
Advertisements

Hash Tables.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2010.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Hash Tables. Container of elements where each element has an associated key Each key is mapped to a value that determines the table cell where element.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2012.
CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CSE 373 Data Structures and Algorithms Lecture 17: Hashing II.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CSE373: Data Structures & Algorithms Lecture 13: Hash Collisions Lauren Milne Summer 2015.
Slide 1 Hashing Slides adapted from various sources.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
Instructor: Lilian de Greef Quarter: Summer 2017
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
Hashing CSE 2011 Winter July 2018.
Hashing - resolving collisions
Hashing Alexandra Stefan.
Hash Tables (Chapter 13) Part 2.
CSE 373: Hash Tables Hunter Zahn Summer 2016 UW CSE 373, Summer 2016.
Handling Collisions Open Addressing SNSCT-CSE/16IT201-DS.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Hash functions Open addressing
Quadratic probing Double hashing Removal and open addressing Chaining
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Hash Table.
Instructor: Lilian de Greef Quarter: Summer 2017
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Hash Tables.
Data Structures and Algorithms
Collision Resolution Neil Tang 02/18/2010
Resolving collisions: Open addressing
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Searching Tables Table: sequence of (key,information) pairs
Data Structures and Algorithms
CSE332: Data Abstractions Lecture 11: Hash Tables
CSE 326: Data Structures Hashing
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Tree traversal preorder, postorder: applies to any kind of tree
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
Collision Resolution Neil Tang 02/21/2008
Ch Hash Tables Array or linked list Binary search trees
Collision Handling Collisions occur when different elements are mapped to the same cell.
Ch. 13 Hash Tables  .
Data Structures and Algorithm Analysis Hashing
DATA STRUCTURES-COLLISION TECHNIQUES
Collision Resolution: Open Addressing Extendible Hashing
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions Linda Shapiro Spring 2016

CSE373: Data Structures & Algorithms Announcements Friday: Review List and go over answers to Practice Problems Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Hash Tables: Review Aim for constant-time (i.e., O(1)) find, insert, and delete “On average” under some reasonable assumptions A hash table is an array of some fixed size But growable as we’ll see hash table … E int table-index collision? collision resolution client hash table library TableSize –1 Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Collision resolution Collision: When two keys map to the same location in the hash table We try to avoid it, but number-of-keys exceeds table size So hash tables should support collision resolution Ideas? Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 / 1 2 3 4 5 6 7 8 9 Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 1 / 2 3 4 5 6 7 8 9 10 / Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 1 / 2 3 4 5 6 7 8 9 10 / 22 / Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 1 / 2 3 4 5 6 7 8 9 10 / 22 / 107 / Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 1 / 2 3 4 5 6 7 8 9 10 / 12 22 / 107 / Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Separate Chaining Chaining: All keys that map to the same table location are kept in a list (a.k.a. a “chain” or “bucket”) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 1 / 2 3 4 5 6 7 8 9 10 / 42 12 22 / 107 / Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Thoughts on chaining Worst-case time for find? Linear But only with really bad luck or bad hash function So not worth avoiding (e.g., with balanced trees at each bucket) Beyond asymptotic complexity, some “data-structure engineering” may be warranted Linked list vs. array vs. tree Move-to-front upon access Maybe leave room for 1 element (or 2?) in the table itself, to optimize constant factors for the common case A time-space trade-off… Spring 2016 CSE373: Data Structures & Algorithms

Time vs. space (constant factors only here) 10 / 1 X 2 42 3 4 5 6 7 107 8 9 1 / 2 3 4 5 6 7 8 9 10 / 12 22 / 42 12 22 / 107 / Spring 2016 CSE373: Data Structures & Algorithms

More rigorous chaining analysis Definition: The load factor, , of a hash table is  number of elements Under chaining, the average number of elements per bucket is ___ Spring 2016 CSE373: Data Structures & Algorithms

More rigorous chaining analysis Definition: The load factor, , of a hash table is  number of elements Under chaining, the average number of elements per bucket is  ie. The average list has length  Spring 2016 CSE373: Data Structures & Algorithms

More rigorous chaining analysis Definition: The load factor, , of a hash table is  number of elements Under chaining, the average number of elements per bucket is  ie. The average list has length  So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against ____ items Spring 2016 CSE373: Data Structures & Algorithms

More rigorous chaining analysis Definition: The load factor, , of a hash table is  number of elements Under chaining, the average number of elements per bucket is  ie. The average list has length  So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against  items Each successful find compares against _____ items Spring 2016 CSE373: Data Structures & Algorithms

More rigorous chaining analysis Definition: The load factor, , of a hash table is  number of elements Under chaining, the average number of elements per bucket is  ie. The average list has length  So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against  items Each successful find compares against  / 2 items So we like to keep  fairly low (e.g., 1 or 1.5 or 2) for chaining Spring 2016 CSE373: Data Structures & Algorithms

Alternative: No lists; Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10 / 1 2 3 4 5 6 7 8 38 9 Spring 2016 CSE373: Data Structures & Algorithms

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10 / 1 2 3 4 5 6 7 8 38 9 19 Spring 2016 CSE373: Data Structures & Algorithms

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10 8 1 / 2 3 4 5 6 7 38 9 19 Spring 2016 CSE373: Data Structures & Algorithms

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10 8 1 109 2 / 3 4 5 6 7 38 9 19 Spring 2016 CSE373: Data Structures & Algorithms

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10 8 1 109 2 10 3 / 4 5 6 7 38 9 19 Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Probing hash tables Trying the next spot is called probing (also called open addressing) We just did linear probing ith probe was (h(key) + i) % TableSize In general have some probe function f and use h(key) + f(i) % TableSize Open addressing does poorly with high load factor  So want larger tables Too many probes means no more O(1) Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Other operations insert finds an open table position using a probe function What about find? Must use same probe function to “retrace the trail” for the data Unsuccessful search when reach empty position What about delete? Must use “lazy” deletion. Why? Marker indicates “no data here, but don’t stop probing” Note: delete with chaining is plain-old list-remove Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms (Primary) Clustering It turns out linear probing is a bad idea, even though the probe function is quick to compute (which is a good thing) Tends to produce clusters, which lead to long probing sequences Called primary clustering Saw this starting in our example [R. Sedgewick] Spring 2016 CSE373: Data Structures & Algorithms

Analysis of Linear Probing Trivial fact: For any  < 1, linear probing will find an empty slot It is “safe” in this sense: no infinite loop unless table is full Non-trivial facts we won’t prove: Average # of probes given  (in the limit as TableSize → ) Unsuccessful search: Successful search: This is pretty bad: need to leave sufficient empty space in the table to get decent performance (see chart) Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms In a chart Linear-probing performance degrades rapidly as table gets full (Formula assumes “large table” but point remains) By comparison, chaining performance is linear in  and has no trouble with >1 Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Quadratic probing We can avoid primary clustering by changing the probe function (h(key) + f(i)) % TableSize A common technique is quadratic probing: f(i) = i2 So probe sequence is: 0th probe: h(key) % TableSize 1st probe: (h(key) + 1) % TableSize 2nd probe: (h(key) + 4) % TableSize 3rd probe: (h(key) + 9) % TableSize … ith probe: (h(key) + i2) % TableSize Intuition: Probes quickly “leave the neighborhood” Spring 2016 CSE373: Data Structures & Algorithms

Quadratic Probing Example 1 2 3 4 5 6 7 8 9 TableSize=10 Insert: 89 18 49 58 79 Spring 2016 CSE373: Data Structures & Algorithms

Quadratic Probing Example 1 2 3 4 5 6 7 8 9 89 TableSize=10 Insert: 89 18 49 58 79 Spring 2016 CSE373: Data Structures & Algorithms

Quadratic Probing Example 1 2 3 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 Spring 2016 CSE373: Data Structures & Algorithms

Quadratic Probing Example 49 1 2 3 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 Spring 2016 CSE373: Data Structures & Algorithms

Quadratic Probing Example 49 1 2 58 3 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 Spring 2016 CSE373: Data Structures & Algorithms

Quadratic Probing Example 49 1 2 58 3 79 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 1 2 3 4 5 6 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 1 2 3 4 5 6 76 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 1 2 3 4 5 40 6 76 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 48 1 2 3 4 5 40 6 76 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 48 1 2 5 3 4 40 6 76 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 48 1 2 5 3 55 4 40 6 76 Spring 2016 CSE373: Data Structures & Algorithms

Another Quadratic Probing Example TableSize = 7 Insert: (76 % 7 = 6) (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 48 1 2 5 3 55 4 40 6 76 Doh!: For all n, ((n*n) +5) % 7 is 0, 2, 5, or 6 No where to put the 47! Spring 2016 CSE373: Data Structures & Algorithms

From Bad News to Good News Quadratic probing can cycle through the same full indices, never terminating despite table not being full Good news: If TableSize is prime and  < ½, then quadratic probing will find an empty slot in at most TableSize/2 probes So: If you keep  < ½ and TableSize is prime, no need to detect cycles Spring 2016 CSE373: Data Structures & Algorithms

Clustering reconsidered Quadratic probing does not suffer from primary clustering: no problem with keys initially hashing to the same neighborhood But it’s no help if keys initially hash to the same index Called secondary clustering Can avoid secondary clustering with a probe function that depends on the key: double hashing… Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Double hashing Idea: Given two good hash functions h and g, it is very unlikely that for some key, h(key) == g(key) So make the probe function f(i) = i*g(key) Probe sequence: 0th probe: h(key) % TableSize 1st probe: (h(key) + g(key)) % TableSize 2nd probe: (h(key) + 2*g(key)) % TableSize 3rd probe: (h(key) + 3*g(key)) % TableSize … ith probe: (h(key) + i*g(key)) % TableSize Detail: Make sure g(key) cannot be 0 Spring 2016 CSE373: Data Structures & Algorithms

Double-hashing analysis Intuition: Because each probe is “jumping” by g(key) each time, we “leave the neighborhood” and “go different places from other initial collisions” But we could still have a problem like in quadratic probing where we are not “safe” (infinite loop despite room in table) It is known that this cannot happen in at least one case: h(key) = key % p g(key) = q – (key % q) 2 < q < p p and q are prime Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Practice ith probe: (h(key) + i*g(key)) % TableSize, i=0, 1, 2, … h(key) = key % p g(key) = q – (key % q) 2 < q < p p and q are prime With p=13 and q=11, TableSize=7 Insert 76 40 48 5 55 47 Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Practice ith probe: (h(key) + i*g(key)) % TableSize, i=0, 1, 2, … h(key) = key % p g(key) = q – (key % q) 2 < q < p p and q are prime With p=13 and q=11, TableSize=7 Insert 76 40 48 5 55 47 (76%13)%7=11%7=4 (40%13)%7=1%7=1 (48%13)%7=9%7=2 (5%13)%7=5%7=5 (55%13)%7=3%7=3 (47%13)%7=8%7=1 X (47%13)+1(11-(47%11))%7= (8+1x8)%7=2 X (8+2x8)%7=3 X (8+3x8)%7=4 X (8+4x8)%7=5 X (8+5x8)%7=6 DONE Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Rehashing As with array-based stacks/queues/lists, if table gets too full, create a bigger table and copy everything With chaining, we get to decide what “too full” means Keep load factor reasonable (e.g., < 1)? Consider average or max size of non-empty chains? For probing, half-full is a good rule of thumb New table size Twice-as-big is a good idea, except that won’t be prime! So go about twice-as-big Can have a list of prime numbers in your code since you won’t grow more than 20-30 times Spring 2016 CSE373: Data Structures & Algorithms

CSE373: Data Structures & Algorithms Summary Hashing gives us approximately O(1) behavior for both insert and find. Collisions are what ruin it. There are several different collision strategies. Chaining just uses linked lists pointed to by the hash table bins. Probing uses various methods for computing the next index to try if the first one is full. Rehashing makes a new, bigger table. If the table is kept reasonably empty (small load factor), and the hash function works well, we will get the kind of behavior we want. Spring 2016 CSE373: Data Structures & Algorithms