Data Structures and Algorithms

Slides:



Advertisements
Similar presentations
Data Structures and Algorithms
Advertisements

Data Structures and Algorithms
CSCE 3400 Data Structures & Algorithm Analysis
Hashing CS 3358 Data Structures.
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 10: Search Structures and Hashing
Hashing General idea: Get a large array
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
Hash Tables1   © 2010 Goodrich, Tamassia.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing1 Hashing. hashing2 Observation: We can store a set very easily if we can use its keys as array indices: A: e.g. SEARCH(A,k) return A[k]
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing Hashing is another method for sorting and searching data.
© 2004 Goodrich, Tamassia Hash Tables1  
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Hash Tables 1/28/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Hashing (part 2) CSE 2011 Winter March 2018.
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Hashing CSE 2011 Winter July 2018.
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Hashing - Hash Maps and Hash Functions
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Dictionaries 9/14/ :35 AM Hash Tables   4
Hash Tables 3/25/15 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Hash functions Open addressing
Hash Table.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Hash Tables – 2 Comp 122, Spring 2004.
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
EE 312 Software Design and Implementation I
CS 5243: Algorithms Hash Tables.
Collision Handling Collisions occur when different elements are mapped to the same cell.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
What we learn with pleasure we never forget. Alfred Mercier
CS 3343: Analysis of Algorithms
Data Structures and Algorithm Analysis Hashing
EE 312 Software Design and Implementation I
Hash Tables – 2 1.
Lecture-Hashing.
Presentation transcript:

Data Structures and Algorithms Searching Hash Tables PLSD210

Hash Tables All search structures so far Assume I have a function Relied on a comparison operation Performance O(n) or O( log n) Assume I have a function f ( key ) ® integer ie one that maps a key to an integer What performance might I expect now?

Hash Tables - Structure Simplest case: Assume items have integer keys in the range 1 .. m Use the value of the key itself to select a slot in a direct access table in which to store the item To search for an item with key, k, just look in slot k If there’s an item there, you’ve found it If the tag is 0, it’s missing. Constant time, O(1)

Hash Tables - Constraints Keys must be unique Keys must lie in a small range For storage efficiency, keys must be dense in the range If they’re sparse (lots of gaps between values), a lot of space is used to obtain speed Space for speed trade-off

Hash Tables - Relaxing the constraints Keys must be unique Construct a linked list of duplicates “attached” to each slot If a search can be satisfied by any item with key, k, performance is still O(1) but If the item has some other distinguishing feature which must be matched, we get O(nmax) where nmax is the largest number of duplicates - or length of the longest chain

Hash Tables - Relaxing the constraints Keys are integers Need a hash function h( key ) ® integer ie one that maps a key to an integer Applying this function to the key produces an address If h maps each key to a unique integer in the range 0 .. m-1 then search is O(1)

Hash Tables - Hash functions Form of the hash function Example - using an n-character key int hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; } returns a value in 0 .. 255 xor function is also commonly used sum = sum ^ *s++; But any function that generates integers in 0..m-1 for some suitable (not too large) m will do As long as the hash function itself is O(1) !

Hash Tables - Collisions Hash function With this hash function int hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; } hash( “AB”, 2 ) and hash( “BA”, 2 ) return the same value! This is called a collision A variety of techniques are used for resolving collisions

Hash Tables - Collision handling Collisions Occur when the hash function maps two different keys to the same address The table must be able to recognise and resolve this Recognise Store the actual key with the item in the hash table Compute the address k = h( key ) Check for a hit if ( table[k].key == key ) then hit else try next entry Resolution Variety of techniques We’ll look at various “try next entry” schemes

Hash Tables - Linked lists Collisions - Resolution Linked list attached to each primary table slot h(i) == h(i1) h(k) == h(k1) == h(k2) Searching for i1 Calculate h(i1) Item in table, i, doesn’t match Follow linked list to i1 If NULL found, key isn’t in table

Hash Tables - Overflow area Linked list constructed in special area of table called overflow area h(k) == h(j) k stored first Adding j Calculate h(j) Find k Get first slot in overflow area Put j in it k’s pointer points to this slot Searching - same as linked list

Hash Tables - Re-hashing Use a second hash function Many variations General term: re-hashing h(k) == h(j) k stored first Adding j Calculate h(j) Find k Repeat until we find an empty slot Calculate h’(j) Put j in it Searching - Use h(x), then h’(x) h’(x) - second hash function

Hash Tables - Re-hash functions The re-hash function Many variations Linear probing h’(x) is +1 Go to the next slot until you find one empty Can lead to bad clustering Re-hash keys fill in gaps between other keys and exacerbate the collision problem

Hash Tables - Re-hash functions The re-hash function Many variations Quadratic probing h’(x) is c i2 on the ith probe Avoids primary clustering Secondary clustering occurs All keys which collide on h(x) follow the same sequence First a = h(j) = h(k) Then a + c, a + 4c, a + 9c, .... Secondary clustering generally less of a problem

Hash Tables - Collision Resolution Summary Chaining Unlimited number of elements Unlimited number of collisions Overhead of multiple linked lists Re-hashing Fast re-hashing Fast access through use of main table space Maximum number of elements must be known Multiple collisions become probable Overflow area Fast access Collisions don't use primary table space Two parameters which govern performance need to be estimated

Hash Tables - Collision Resolution Summary Re-hashing Fast re-hashing Fast access through use of main table space Maximum number of elements must be known Multiple collisions become probable Overflow area Fast access Collisions don't use primary table space Two parameters which govern performance need to be estimated

Hash Tables - Summary so far ... Potential O(1) search time If a suitable function h(key) ® integer can be found Space for speed trade-off “Full” hash tables don’t work (more later!) Collisions Inevitable Hash function reduces amount of information in key Various resolution strategies Linked lists Overflow areas Re-hash functions Linear probing h’ is +1 Quadratic probing h’ is +ci2 Any other hash function! or even sequence of functions!

Hash Tables - Choosing the Hash Function “Almost any function will do” But some functions are definitely better than others! Key criterion Minimum number of collisions Keeps chains short Maintains O(1) average

Hash Tables - Choosing the Hash Function Uniform hashing Ideal hash function P(k) = probability that a key, k, occurs If there are m slots in our hash table, a uniform hashing function, h(k), would ensure: or, in plain English, the number of keys that map to each slot is equal S P(k) = k | h(k) = 0 S P(k) = .... k | h(k) = 1 k | h(k) = m-1 1 m Read as sum over all k such that h(k) = 0

Hash Tables - A Uniform Hash Function If the keys are integers randomly distributed in [ 0 , r ), then is a uniform hash function Most hashing functions can be made to map the keys to [ 0 , r ) for some r eg adding the ASCII codes for characters mod 255 will give values in [ 0, 256 ) or [ 0, 255 ] Replace + by xor same range without the mod operation Read as 0 £ k < r h(k) = mk r

Hash Tables - Reducing the range to [ 0, m ) We’ve mapped the keys to a range of integers 0 £ k < r Now we must reduce this range to [ 0, m ) where m is a reasonable size for the hash table Strategies Division - use a mod function Multiplication Universal hashing

Hash Tables - Reducing the range to [ 0, m ) Division Use a mod function h(k) = k mod m Choice of m? Powers of 2 are generally not good! h(k) = k mod 2n selects last n bits of k All combinations are not generally equally likely Prime numbers close to 2n seem to be good choices eg want ~4000 entry table, choose m = 4093 0110010111000011010 k mod 28 selects these bits

Hash Tables - Reducing the range to [ 0, m ) Multiplication method Multiply the key by constant, A, 0 < A < 1 Extract the fractional part of the product ( kA - ëkAû ) Multiply this by m h(k) = ëm * ( kA - ëkAû )û Now m is not critical and a power of 2 can be chosen So this procedure is fast on a typical digital computer Set m = 2p Multiply k (w bits) by ëA•2wû ç 2w bit product Extract p most significant bits of lower half A = ½(Ö5 -1) seems to be a good choice (see Knuth)

Hash Tables - Reducing the range to [ 0, m ) Universal Hashing A determined “adversary” can always find a set of data that will defeat any hash function Hash all keys to same slot ç O(n) search Select the hash function randomly (at run time) from a set of hash functions Reduced probability of poor performance Set of functions, H, which map keys to [ 0, m ) H, is universal, if for each pair of keys, x and y, the number of functions, h Ì H, for which h(x) = h(y) is |H |/m

Hash Tables - Reducing the range to ( 0, m ] Universal Hashing A determined “adversary” can always find a set of data that will defeat any hash function Hash all keys to same slot ç O(n) search Select the hash function randomly (at run time) from a set of hash functions --------- Functions are selected at run time Each run can give different results Even with the same data Good average performance obtainable

Hash Tables - Reducing the range to ( 0, m ] Universal Hashing Can we design a set of universal hash functions? Quite easily Key, x = x0, x1, x2, ...., xr Choose a = <a0, a1, a2, ...., ar> a is a sequence of elements chosen randomly from { 0, m-1 } ha(x) = S aixi mod m There are mr+1 sequences a, so there are mr+1 functions, ha(x) Theorem The ha form a set of universal hash functions x0 x1 x2 .... xr n-bit “bytes” of x Proof: See Cormen

Collision Frequency Birthdays or the von Mises paradox There are 365 days in a normal year Birthdays on the same day unlikely? How many people do I need before “it’s an even bet” (ie the probability is > 50%) that two have the same birthday? View the days of the year as the slots in a hash table the “birthday function” as mapping people to slots Answering von Mises’ question answers the question about the probability of collisions in a hash table

Distinct Birthdays Let Q(n) = probability that n people have distinct birthdays Q(1) = 1 With two people, the 2nd has only 364 “free” birthdays The 3rd has only 363, and so on: Q(2) = Q(1) * 364 365 Q(n) = Q(1) * 364 365 365-n+1 * * … *

Coincident Birthdays Probability of having two identical birthdays P(n) = 1 - Q(n) P(23) = 0.507 With 23 entries, table is only 23/365 = 6.3% full!

Hash Tables - Load factor Collisions are very probable! Table load factor must be kept low Detailed analyses of the average chain length (or number of comparisons/search) are available Separate chaining linked lists attached to each slot gives best performance but uses more space! n = number of items  n m m = number of slots

Hash Tables - General Design Choose the table size Large tables reduce the probability of collisions! Table size, m n items Collision probability a = n / m Choose a table organisation Does the collection keep growing? Linked lists (....... but consider a tree!) Size relatively static? Overflow area or Re-hash Choose a hash function ....

Hash Tables - General Design Choose a hash function A simple (and fast) one may well be fine ... Read your text for some ideas! Check the hash function against your data Fixed data Try various h, m until the maximum collision chain is acceptable Known performance Changing data Choose some representative data Try various h, m until collision chain is OK Usually predictable performance

Hash Tables - Review If you can meet the constraints Hash Tables will generally give good performance O(1) search Like radix sort, they rely on calculating an address from a key But, unlike radix sort, relatively easy to get good performance with a little experimentation not advisable for unknown data collection size relatively static memory management is actually simpler All memory is pre-allocated!