Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows.

Dictionaries and Hash Tables

Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows for quick retrieval. – Items must be stored in a way that allows them to be located with the key – Not necessary to store the items in order Unordered dictionary Ordered dictionary

Dictionary ADT Operations in a Dictionary ADT: intsize() boolisEmpty() iterelements() iterkeys() posfind( key ) iterfindAll( key ) voidinsertItem( key, elem ) voidremoveElement( key ) voidremoveAllElements( key )

Dictionary Examples Natural language dictionary word is key element contains word, definition, pronunciation, etc. Web pages URL is key html or other file is element Any typical database (e.g. student record) has one or more search keys each key may require own organizational dictionary

Implementing a Dictionary There are many ways a dictionary can be implemented. Some of them are: – Log file or Audit Trail – Ordered Dictionary and Binary search trees – Hash table

Log File or Audit Trail This is the simplest way to implement a dictionary. It uses an unordered vector, list or sequence to store the key-element pairs. void insertItem( key, elem ) Each new item is appended at the end – O(1) pos find( key ) Scan the entire list and examine each key – O(n) void removeElement( key ) Scan the entire list to find the item, then remove it – O(n) This allows for fast insertions. However, find and retrieval are slow. – Good solution for storing items that are stored frequently but retrieved rarely such as archiving database and operating systems transactions. – Storing log file

Ordered Dictionary ADT All of the Dictionary operations, e.g. find(k), insertItem(k,e), removeElement(k) Additional operations posclosestBefore( key ) posclosestAfter( key )

Look-Up Tables A look-up table is an implementation of an ordered dictionary ( eg. trigonometry table ) Here is an example, where all items are stored in a vector, in ascending order of the keys. 01234 A 5678910 1326537162115

Lookup Table Performance In a look-up table, inserting or removing may require shifting elements 01234 A 5678910 1326537162115 01234 A 5678910 13265371621152 Example: Insert an item with a key of 2 n elements shifted to make room insertItem(k,e) takes O(n) time in the worst case removeElement(k) takes O(n) time in the worst case

Lookup Table – find(k) However, since the items in a lookup table are ordered, we can implement find(k) with a binary search algorithm A binary search algorithm (or binary chop) is a technique for finding a particular value in a linear array, by ruling out half of the data at each step. A binary search finds the median, makes a comparison to determine whether the desired value comes before or after it, and then searches the remaining half in the same manner. A binary search is an example of a divide and conquer algorithm.

01234 A 56789101112131415 Binary Search 51241489722233193727282517 Example: find(22) lowhighmid 01234 A 56789101112131415 22331937272817 midhighlow 512414897225 A 2217 midhighlow 5124148972333727282519 A low = mid = high 51241489723337272825221719

Binary Search Algorithm Algorithm BinarySearch( A, k, low, high) if low > high then return Null else mid = (low + high) / 2 if ( k == key(mid) ) then return Position(mid) else if ( k < key(mid) ) then return BinarySearch( A, k, low, mid – 1 ) else return BinarySearch( A, k, mid + 1, high )

Hash Tables In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value. This is considered the most efficient way to implement a dictionary.

Hash Table

Bucket Arrays A Bucket array for a hash table, is an array A of size N, where each cell of A is thought of as a ‘bucket’, and N defines the capacity of the array. Example – Small company with less than 100 employees – Each employee has an ID number in the range 0–99 – Store employee records in an array, so that the employee ID number matches the array index EMPTY 01 Turing, A. 02 Babbage, C. EMPTY 04 Gates, W. 01234 A …

Bucket Arrays If the keys are unique, then searches, insertions and removals in the bucket array take worst-case time of O(1). However, bucket arrays have 2 drawbacks. – It requires a capacity of N (which is the maximum number of elements possible – The key has to be a integer in the range [0, N-1]

Hash Functions A good hash function is essential for good hash table performance. If a hash function tends to produce similar values, slow searches will result. Example – Small company with less than 100 employees – Already uses a 5-digit ID number A simple hash function for this example is ( ID % 100 ) EMPTY 55301 Turing, A. 81202 Babbage, C. EMPTY 77404 Gates, W. 01234 A …

Hash Functions A hash function is a way of creating a small digital "fingerprint" from any kind of data. The function chops and mixes the data to create the fingerprint, often called a hash value. A good hash function is one that yields few hash collisions in expected input domains. To do this, the index into the hash table's array is generally calculated in two steps: – A generic hash value is calculated to map the key to an integer ( hash code ) – This value is reduced to a valid array index ( compression map )

Hash Code Take an arbitrary key k and assigning it to an integer value h. Then h is know as the hash code or hash value of k. key -> integer This integer h does not need to be in the range of the array that is being used for hashing and may even be a negative number, but we want the set of hash codes assigned to our keys to avoid collisions as much as possible. Hash coding can be done in many ways: – Integer cast – Summing components – Polynomial accumulations

Hash code – Integer Cast int hashCode( int key ) { return key; } int hashCode( char key ) { return hashCode( int(key) ); // cast it // to an integer }

Hash code – Summing Components If the long int has twice as many bits as the int datatype, e.g. 32 bits for int, 64 bits for long Treat the high-order bits as an integer and the low- order bits as an integer, then sum them int hashCode( long key ){ typedef unsigned long ulong; return hashCode( int( ulong(key) >> 32 ) + int( key ) ); }

Hash code – Summing Components Applied to Strings One approach is to sum the ASCII values of all the chars in the string – Problem: too many collisions because many different words will have the same result – For example, stop, tops, pots, spot ASCII s = 115 t = 116 o = 111 p = + 112 Hashcode = 454

Hash code – Polynomial Accumulation Better approach for string keys – Modify each char’s ASCII value by a number based on its position in the string – Then sum the results – Where x represents a char, k is the total number of chars, and a is a constant (but not 1), the following formula can be used: x 0 a k-1 + x 1 a k-2 + … + x k-2 a + x k-1 s = 115 * 10 3 = 115000 t = 116 * 10 2 = 11600 o = 111 * 10 1 = 1110 p = 112 * 10 0 = + 112 Hashcode = 127822 Example, assume that the string is “stop” and a = 10

Compression Maps This is the second part of the hash function action. Once we have a hash code, we need to map it to an integer in the range of array index numbers This can me accomplished in many ways: – Truncation – Truncation and Summation – Division method – MAD method

Compression Maps - Truncation One way would be to simply ignore parts of the key and use the remaining part. Eg: employee number: 15436578 bucket size: 1000 possibility 1: k = last 3 digits = 578 possibility 2: k = digits 4, 6 and 8 = 358 This is a fast scheme, but it fails to give an even distribution of keys throughout the table.

Compression Maps – Truncation and Summation This method might use a combination of truncating and summing parts of the key. Eg: employee number: 15496578 bucket size: 1000 possibility: k = partition into 3, and together and truncate if necessary. k = 154 + 965 + 78 = 1197 = 197 This provides a better spread than simple truncation, but it still does not prevent collision.

Compression Maps - Division Method int k = hashCode( key ); int index = abs(k) % ARRAY_SIZE; It has been found that the size of the array should be a prime number. This reduces the number of collisions and spreads out the distribution of hashed values Example Keys = {200,210,220,230,…,600} IF Array size = 100 - a non-prime number produces collisions for each hash code IF Array size = 101 - a prime number produces less collisions for each hash code

Compression Maps - MAD Method This is another method to convert the hash code into a known range. MAD stands for “Multiply, Add, and Divide” where a and b are non-negative integers (a % ARRAY_SIZE) must not be 0 a and b are chosen at random when the program is written int k = hashCode( key ); int i = abs(a * k + b) % ARRAY_SIZE; – Example: Keys = {200,210,220,230,…,600} where a=8, b=7, array size = 100 200 => (8*200+7) % 100 => 7 210 => (8*210+7) % 100 => 87 220 => (8*220+7) % 100 => 67 230 => (8*230+7) % 100 => 47

Collisions There is no restriction as to the key being unique or for the hash function to generate a unique value. This means that there is a chance that there might be more than one element that wants to be mapped to the same position. This would create a collision.

Collisions Two different keys are mapped to the same location in the array Best approach – minimize collisions by picking a good hash function Example – A bad hash function is ( key % 100 ) because it is too likely to cause collisions. key % 101 is better.

Collisions If two keys hash to the same index, the corresponding records cannot be stored in the same location. So, if it's already occupied, we must find another location to store the new record, and do it so that we can find it when we look it up later on. Example – Previous hash function of ( ID % 100 ) is too likely to cause collisions EMPTY 55301 Turing, A. 81202 Babbage, C. EMPTY 77404 Gates, W. 01234 A 38104 McNealy, S. ! …

Collision Handling There are a number of collision resolution techniques, but the most popular are chaining and open addressing. Two different approaches – Chaining – Open addressing

Chaining Separate chaining is a method for dealing with collisions. The hash table is an array of linked lists. Data elements that hash to the same value are stored in a linked list originating from the index equivalent of their hash value. – Each location in the hash table holds a pointer to a list – Each list can hold many items – As long as the hash function is good, the lists will be small because there will be few collisions

Separate Chaining Example 90 next NULL 12 next38 next25 next 0 A 1 2 3 4 5 6 7 8 9 10 11 12 36 next NULL 10 next 41 next NULL 28 next54 next 18 next NULL

Open Addressing This is a method where only one item is always stored in one bucket. If multiple elements map to same bucket, some method must be used to find an empty bucket Linear probing h’(k) = ( h(k) + j ) mod N where j = 0, 1, 2, 3,... »Keep adding 1 to rank to find empty bucket Quadratic probing h’(k) = ( h(k) + j² ) mod N where j = 0, 1, 2, 3,... Double hashing h’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3,... where h’’(k) = q – (k mod q )

Linear Probing If a bucket is already occupied, then try the next available bucket EMPTY 55301 Turing, A. 81202 Babbage, C. EMPTY 77404 Gates, W. 01234 A 38104 McNealy, S. ! …

Linear Probing If a bucket is already occupied, then try the next available bucket EMPTY 55301 Turing, A. 81202 Babbage, C. EMPTY 77404 Gates, W. 01234 A 38104 McNealy, S. ! 38104 McNealy, S. 55301 Turing, A. 81202 Babbage, C.EMPTY 77404 Gates, W. 01234 A … …

Linear Probing – insertItem(k,e) If a location is already occupied, then try the next available location Example: – h(k) = ( (k % cap) + j ) mod cap where j = 0, 1, 2, 3,... – Insert the following keys into hash table A {13,26,5,37,16,21,15} 01234 A 5678910 1326537162115

Linear Probing – Using Lazy Deletes Problem: – If the find() operation is looking for a key, it stops looking when it gets to an empty location and assumes the key isn’t there – If multiple items with the same key are stored in the hash table with linear probing and then one of them is deleted, a “hole” is created, and find() might stop prematurely 01234 A 5678910 132637162115 Solution: Implement removeElement so that it never deletes an item, it just marks the location “FREE” FREE EMPTY

Quadratic Probing Quadratic Probing is another open addressing strategy to deal with collisions. It uses the following formula: h(k) = ( (k % cap) + j² ) mod cap where j = 0, 1, 2, 3,... Example: {13,26,5,37,16,21,15} ((37 % 11) + 0 2 ) % 11 = 4 //collision ((37 % 11) + 1 2 ) % 11 = 5 //collision ((37 % 11) + 2 2 ) % 11 = 8 //OK 01234 A 5678910 1326537162115

Quadratic Probing Pros and Cons Advantages – Avoids clustering Disadvantages – Creates secondary clustering – a different pattern of filled array locations – If the load factor is 0.5 or more, an empty location may not be found even if one exists

Double Hashing Double hashing is another alternative to linear probing where, if there’s a collision, then a second, different hash function h' is used h’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3,... and where h’’(k) = q – (k mod q ) h(k) = ( (key % cap) + (j * ( q – ( key % q ) ) ) ) % cap where j = 0, 1, 2, 3,.. Example: {13,26,5,37} Let q = 7 h(k) = ( (37 % 11) + (j * ( 7 – ( 37 % 7 ) ) ) ) % 11 h(k) = h(37) + 0*(…) = 37 % 11 = 4 //collision h(k) = (4 + (7 – (37 % 7)) % 11 = 9 //OK 01234 A 5678910 1326537

Load Factor The load factor of a hashing table is the ratio of the number of items in the hash table to the number of buckets and is expressed by ( lambda ) – Expresses how “full” the hash table has become – Should always be kept below 0.75 – Example capacity = 11 items stored = 7 load factor = 7/11 = 0.64

Rehashing Maximum load factor, based on experimental data: – 0.5 for open addressing schemes – 0.9 for separate chaining If the load factor is above that threshold, then the table should be resized – New table should be at least double the old table so that the time cost can be amortized – Hash function should be modified – Rehash the data – take each item out of the old array and insert it into the new one using the new hash function

Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows.

Similar presentations

Presentation on theme: "Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows.

Similar presentations

Presentation on theme: "Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows."— Presentation transcript:

Similar presentations

About project

Feedback