Instructor: Lilian de Greef Quarter: Summer 2017

Slides:



Advertisements
Similar presentations
Preliminaries Advantages –Hash tables can insert(), remove(), and find() with complexity close to O(1). –Relatively easy to program Disadvantages –There.
Advertisements

Hash Tables.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2010.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.
Hashing General idea: Get a large array
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
CSE332: Data Abstractions Lecture 11: Hash Tables Dan Grossman Spring 2012.
CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CSE373: Data Structures & Algorithms Lecture 13: Hash Collisions Lauren Milne Summer 2015.
Fundamental Structures of Computer Science II
Instructor: Lilian de Greef Quarter: Summer 2017
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
Hashing.
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
Hashing CSE 2011 Winter July 2018.
Hashing - resolving collisions
Hashing Alexandra Stefan.
CSE 373: Hash Tables Hunter Zahn Summer 2016 UW CSE 373, Summer 2016.
Instructor: Lilian de Greef Quarter: Summer 2017
Handling Collisions Open Addressing SNSCT-CSE/16IT201-DS.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Hash functions Open addressing
Quadratic probing Double hashing Removal and open addressing Chaining
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Hash Table.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Hash Tables.
Data Structures and Algorithms
Chapter 21 Hashing: Implementing Dictionaries and Sets
Collision Resolution Neil Tang 02/18/2010
Resolving collisions: Open addressing
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
CSE332: Data Abstractions Lecture 11: Hash Tables
Hash Tables – 2 Comp 122, Spring 2004.
CSE 326: Data Structures Hashing
Data Structures – Week #7
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
Collision Resolution Neil Tang 02/21/2008
Data Structures – Week #7
CSE 326: Data Structures Lecture #12 Hashing II
Collision Handling Collisions occur when different elements are mapped to the same cell.
Ch. 13 Hash Tables  .
What we learn with pleasure we never forget. Alfred Mercier
Data Structures and Algorithm Analysis Hashing
DATA STRUCTURES-COLLISION TECHNIQUES
Hash Tables – 2 1.
Collision Resolution: Open Addressing Extendible Hashing
CSE 373: Data Structures and Algorithms
Presentation transcript:

Instructor: Lilian de Greef Quarter: Summer 2017 CSE 373: Data Structures and Algorithms Lecture 7: Hash Table Collisions Instructor: Lilian de Greef Quarter: Summer 2017

Today Announcements Hash Table Collisions Collision Resolution Schemes Separate Chaining Open Addressing / Probing Linear Probing Quadratic Probing Double Hashing Rehashing

Announcements Reminder: homework 2 due tomorrow Homework 3: Hash Tables Will be out tomorrow night Pair-programming opportunity! (work with a partner) Ideas for finding partner: before/after class, section, Piazza Pair-programming: write code together 2 people, 1 keyboard One is the “navigator,” the other the “driver” Regularly switch off to spend equal time in both roles Side note: our brains tend to edit out when we make typos Need to be in same physical space for entire assignment, so partner and plan accordingly!

Review: Hash Tables & Collisions

Hash Tables: Review A data-structure for the dictionary ADT find, insert, or delete (key, value) apply hash function h(key) = hash value index = hash value % table size array[index] = (key, value) if collision, apply collision resolution A data-structure for the dictionary ADT Average case O(1) find, insert, and delete (when under some often-reasonable assumptions) An array storing (key, value) pairs Use hash value and table size to calculate array index Hash value calculated from key using hash function

Hash Table Collisions: Review We try to avoid them by Unfortunately, collisions are unavoidable in practice Number of possible keys >> table size No perfect hash function & table-index combo

Collision Resolution Schemes: your ideas

Collision Resolution Schemes: your ideas

Separate Chaining One of several collision resolution schemes

Separate Chaining 1 2 3 4 5 6 7 8 9 All keys that map to the same table location (aka “bucket”) are kept in a list (“chain”). Example: insert 10, 22, 107, 12, 42 and TableSize = 10 (for illustrative purposes, we’re inserting hash values)

Separate Chaining: Worst-Case What’s the worst-case scenario for find? What’s the worst-case running time for find? But only with really bad luck or really bad hash function

Separate Chaining: Further Analysis How can find become slow when we have a good hash function? How can we reduce its likelihood?

Rigorous Analysis: Load Factor Definition: The load factor () of a hash table with N elements is 𝜆= 𝑁 𝑡𝑎𝑏𝑙𝑒 𝑠𝑖𝑧𝑒 Under separate chaining, the average number of elements per bucket is _____ For a random find, on average an unsuccessful find compares against _______ items a successful find compares against _______ items

Rigorous Analysis: Load Factor Definition: The load factor () of a hash table with N elements is 𝜆= 𝑁 𝑡𝑎𝑏𝑙𝑒 𝑠𝑖𝑧𝑒 To choose a good load factor, what are our goals? So for separate chaining, a good load factor is

Open Addressing / Probing Another family of collision resolution schemes

Idea: use empty space in the table 1 2 3 4 5 6 7 8 9 If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full… Example: insert 38, 19, 8, 109, 10

Open Addressing Terminology Trying the next spot is called (also called ) We just did ith probe was (h(key) + i) % TableSize In general have some f and use (h(key) + f(i)) % TableSize

Dictionary Operations with Open Addressing insert finds an open table position using a probe function What about find? What about delete? Note: delete with separate chaining is plain-old list-remove

Practice: The keys 12, 18, 13, 2, 3, 23, 5 and 15 are inserted into an initially empty hash table of length 10 using open addressing with hash function h(k) = k mod 10 and linear probing. What is the resultant hash table? 1 2 3 23 4 5 15 6 7 8 18 9 1 2 12 3 13 4 5 6 7 8 18 9 1 2 12 3 13 4 5 6 23 7 8 18 9 15 1 2 12, 2 3 13, 3, 23 4 5 5, 15 6 7 8 18 9 (A) (B) (C) (D)

Open Addressing: Linear Probing Quick to compute!  But mostly a bad idea. Why?

(Primary) Clustering [R. Sedgewick] Linear probing tends to produce clusters, which lead to long probing sequences Called Saw this starting in our example

Analysis of Linear Probing For any  < 1, linear probing will find an empty slot It is “safe” in this sense: no infinite loop unless table is full Non-trivial facts we won’t prove: Average # of probes given  (in the limit as TableSize → ) Unsuccessful search: Successful search: This is pretty bad: need to leave sufficient empty space in the table to get decent performance (see chart) as lambda increases (gets closer to 1), so does the number of probes (lambda must be less than one for probing to work) if lambda =.75, expect 8.5 probes if lambda = .9, 50 probes

Analysis: Linear Probing Linear-probing performance degrades rapidly as table gets full (Formula assumes “large table” but point remains) By comparison, chaining performance is linear in  and has no trouble with >1

Any ideas for alternatives?

Open Addressing: Quadratic Probing We can avoid primary clustering by changing the probe function (h(key) + f(i)) % TableSize A common technique is quadratic probing: f(i) = i2 So probe sequence is: 0th probe: h(key) % TableSize 1st probe: 2nd probe: 3rd probe: … ith probe: (h(key) + i2) % TableSize Intuition: Probes quickly “leave the neighborhood”

Quadratic Probing Example #1 1 2 3 4 5 6 7 8 9 TableSize = 10 Insert: 89 18 49 58 79 ith probe: (h(key) + i2) % TableSize

Quadratic Probing Example #2 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 1 2 3 4 5 6 ith probe: (h(key) + i2) % TableSize

Quadratic Probing: Bad News, Good News Quadratic probing can cycle through the same full indices, never terminating despite table not being full Good news: If TableSize is prime and  < ½, then quadratic probing will find an empty slot in at most TableSize/2 probes So: If you keep  < ½ and TableSize is prime, no need to detect cycles Proof is posted online next to lecture slides Also, slightly less detailed proof in textbook Key fact: For prime T and 0 < i,j < T/2 where i  j, (k + i2) % T  (k + j2) % T (i.e., no index repeat)

Clustering Part 2 Quadratic probing does not suffer from primary clustering: no problem with keys initially hashing to the same neighborhood But it’s no help if keys initially hash to the same index: This is called Can avoid secondary clustering

Open Addressing: Double Hashing Idea: Given two good hash functions h and g, it is very unlikely that for some key, h(key) == g(key) So make the probe function f(i) = i*g(key) Probe sequence: 0th probe: h(key) % TableSize 1st probe: (h(key) + g(key)) % TableSize 2nd probe: 3rd probe: … ith probe: (h(key) + i*g(key)) % TableSize

Double Hashing Analysis Intuition: Because each probe is “jumping” by g(key) each time, we “leave the neighborhood” and “go different places from other initial collisions” Requirements for second hash function: Example of double hash function pair that works: h(key) = key % p g(key) = q – (key % q) 2 < q < p p and q are prime

More Double Hashing Facts Assume “uniform hashing” Means probability of g(key1) % p == g(key2) % p is 1/p Non-trivial facts we won’t prove: Average # of probes given  (in the limit as TableSize → ) Unsuccessful search (intuitive): Successful search (less intuitive): Bottom line: unsuccessful bad (but not as bad as linear probing), but successful is not nearly as bad distribute items uniformly, probability of collisiion = N/tableSize or load factor expected number of probes for successful search is as good as random unsuccessful if lambda =.8, check 1/(1-.8) = 1/.2 = 5 boxes

Charts

Rehashing

Rehashing What do we do if the table gets too full? How do we copy over elements?

Rehashing What’s “too full” in Separate Chaining? “Too full” for Open Addressing / Probing

Rehashing How big do we want to make the new table? Can keep a list of prime numbers in your code, since you likely won't grow more than 20-30 times (2^30 = 1,073,741,824)