Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD.

Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD

Flashback Still recall something before midterm?

Recall the previous ADT List, Stack, Queue List, Stack, Queue BST BST AVL AVL We wish to do Insert, Delete and Find We wish to do Insert, Delete and Find

Do we happy with AVL tree? We want something FASTER!!! We want something FASTER!!! Possible with “hashing” Possible with “hashing”

Overview What is hash?

Hashing property Insert O(1) Insert O(1) Delete O(1) Delete O(1) Find O(1) Find O(1) Lacks order, traversal Lacks order, traversal

Hashing Idea What happen if all possible data is the date of the month? What happen if all possible data is the date of the month? Post Office analogy Post Office analogy Addressing!!! Addressing!!!

Hash Table Table (storage) Table (storage) Key (address) Key (address) And possibly the Value (things to store) And possibly the Value (things to store)

Realization of Hash Table Put it into a practical use

What is the problem? What if the data is not only 1 – 31? What if the data is not only 1 – 31? What if the data is not a number ? What if the data is not a number ?

Hash function Mash key into something else Mash key into something else

Hash function We use it to try to distribute values evenly throughout our table. We may use: We use it to try to distribute values evenly throughout our table. We may use: Key number % tableSize Key number % tableSize But if tableSize is 10, 20, 30, …we cannot use this function. But if tableSize is 10, 20, 30, …we cannot use this function. What if keys are Strings? What if keys are Strings? Let’s see some example. Let’s see some example.

Hash function (1 st example) Sum the ASCII values of all alphabets Sum the ASCII values of all alphabets public static int hash(String key, int tableSize){ int hashVal = 0; for(int i =0; i<key.length(); i++) hashVal += key.charAt(i); return hashVal%tableSize; }

The method in the last page is not good if the table is large: The method in the last page is not good if the table is large: Whet if each key is short (e.g. 8 alphabets?) Whet if each key is short (e.g. 8 alphabets?) An ASCII normally has a maximum value of 127. An ASCII normally has a maximum value of 127. Therefore the sum of all 8 alphabets will not exceed 127*8. Therefore the sum of all 8 alphabets will not exceed 127*8. If the table is big, data will not be distributed evenly. If the table is big, data will not be distributed evenly. The 10,000th member Indices will concentrate at the front.

Hash function (2 nd example) Assume we have a big table, and each key is made from at least 3 random alphabets. Assume we have a big table, and each key is made from at least 3 random alphabets. We look at the first 3 alphabets only. We look at the first 3 alphabets only. public static int hash(String key, int tableSize){ return (key.charAt(0) +27*key.charAt(1) +729* key.charAT(2))%tableSize; } All alphabets, including space 27*27 This distributes well in a table of size 10000. (10007 is the first prime after 10000, we will use this number. You will see why).

Wait, any actual key will never be random like this: Wait, any actual key will never be random like this: There will be a lot of repetition. There will be a lot of repetition.

Hash function (3rd example) We calculate a polynomial function of 37, using Horner’s Rule. We calculate a polynomial function of 37, using Horner’s Rule. We can calculate k 0 + 37k 1 + 37*37k 2 by using We can calculate k 0 + 37k 1 + 37*37k 2 by using [(k 2 *37)+k 1 ]*37 +k 0 [(k 2 *37)+k 1 ]*37 +k 0 Horner rule is to repeat this -> n times. In fact, it is a calculation of: Horner rule is to repeat this -> n times. In fact, it is a calculation of:

public static int hash(String key, int tableSize){ int hashVal = 0; for(int i =0; i<key.length(); i++) hashVal= 37*hashVal+key.charAt(i); hashVal %= tableSize; if(hashVal<0) hashVal += tableSize; return hashVal; } Possible overflow

What is “GOOD” hash function? Low cost Low cost Determinism Determinism Uniformity Uniformity Variable Range Variable Range Injective Injective Perfect hash function? Perfect hash function?

Side Note eMule, BitTorrent, all P2P MD5

Still More Problem Collision

Collision Resolution Separate Chaining Separate Chaining Toolbox analogy Toolbox analogy Open addressing Open addressing Library shelf analogy Library shelf analogy

Separate Chaining Try to put it into the same position Try to put it into the same position Use another “Data Structure” Use another “Data Structure”

Fixing collision: separate chaining Store repeated elements in a linked list. Store repeated elements in a linked list. If you want to search for an element, use hash function, then search in the list given by that hash function. If you want to search for an element, use hash function, then search in the list given by that hash function. If you want to insert an element, If you want to insert an element, use hash function to find a list to put that element in. use hash function to find a list to put that element in. After that, check the list to see whether it already contains the element. If the list does not have that element then insert the element at the front. After that, check the list to see whether it already contains the element. If the list does not have that element then insert the element at the front. Statistically, a newly inserted element is often accessed again soon after the insertion. Statistically, a newly inserted element is often accessed again soon after the insertion.

Code for an object that has a hash function. 1. public interface Hashable 2. { 3. /** 4. * Compute a hash function for this object. 5. * @param tableSize the hash table size. 6. * @return (deterministically) a number between 7. * 0 and tableSize-1, distributed equitably. 8. */ 9. int hash( int tableSize ); 10. }

static method from our HashTable class. How we use a Hashable object. Public class Student implements Hashable{ private String name; private double number; private int year; public int hash(int tableSize){ return SeparateChainingHashTable.hash(name, tableSize); } public boolean equals(Object rhs){ return name.equals(((Student)rhs).name); }}

1. public class SeparateChainingHashTable 2. { 3. /** 4. * Construct the hash table. 5. */ 6. public SeparateChainingHashTable( ) 7. { 8. this( DEFAULT_TABLE_SIZE ); 9. } 10. /** 11. * Construct the hash table. 12. * @param size approximate table size. 13. */ 14. public SeparateChainingHashTable( int size ) 15. { 16. theLists = new LinkedList[ nextPrime( size ) ]; 17. for( int i = 0; i < theLists.length; i++ ) 18. theLists[ i ] = new LinkedList( ); 19. }

20. /** 21. * Insert into the hash table. If the item is 22. * already present, then do nothing. 23. * @param x the item to insert. 24. */ 25. public void insert( Hashable x ) 26. { 27. LinkedList whichList = theLists[ x.hash( theLists.length ) ]; 28. LinkedListItr itr = whichList.find( x ); 29. if( itr.isPastEnd( ) ) 30. whichList.insert( x, whichList.zeroth( ) ); 31. } 32. /** 33. * Remove from the hash table. 34. * @param x the item to remove. 35. */ 36. public void remove( Hashable x ) 37. { 38. theLists[ x.hash( theLists.length ) ].remove( x ); 39. } We use Student here

40. /** 41. * Find an item in the hash table. 42. * @param x the item to search for. 43. * @return the matching item, or null if not found. 44. */ 45. public Hashable find( Hashable x ) 46. { 47. return (Hashable)theLists[ x.hash( theLists.length ) ].find( x ).retrieve( ); 48. } 49. /** 50. * Make the hash table logically empty. 51. */ 52. public void makeEmpty( ) 53. { 54. for( int i = 0; i < theLists.length; i++ ) 55. theLists[ i ].makeEmpty( ); 56. }

57. /** 58. * A hash routine for String objects. 59. * @param key the String to hash. 60. * @param tableSize the size of the hash table. 61. * @return the hash value. 62. */ 63. public static int hash( String key, int tableSize ) 64. { 65. int hashVal = 0; 66. for( int i = 0; i < key.length( ); i++ ) 67. hashVal = 37 * hashVal + key.charAt( i ); 68. hashVal %= tableSize; 69. if( hashVal < 0 ) 70. hashVal += tableSize; 71. return hashVal; 72. }

73. private static final int DEFAULT_TABLE_SIZE = 101; 74. /** The array of Lists. */ 75. private LinkedList [ ] theLists; 76. /** 77. * Internal method to find a prime number at least as large as n. 78. * @param n the starting number (must be positive). 79. * @return a prime number larger than or equal to n. 80. */ 81. private static int nextPrime( int n ) 82. { 83. if( n % 2 == 0 ) 84. n++; 85. for( ; !isPrime( n ); n += 2 ) 86. ; 87. return n; 88. }

89. /** 90. * Internal method to test if a number is prime. 91. * Not an efficient algorithm. 92. * @param n the number to test. 93. * @return the result of the test. 94. */ 95. private static boolean isPrime( int n ) 96. { 97. if( n == 2 || n == 3 ) 98. return true; 99. if( n == 1 || n % 2 == 0 ) 100. return false; 101. for( int i = 3; i * i <= n; i += 2 ) 102. if( n % i == 0 ) 103. return false; 104. return true; 105. }

106. // Simple main 107. public static void main( String [ ] args ) 108. { 109. SeparateChainingHashTable H = new SeparateChainingHashTable( ); 110. final int NUMS = 4000; 111. final int GAP = 37; 112. System.out.println( "Checking... (no more output means success)" ); 113. for( int i = GAP; i != 0; i = ( i + GAP ) % NUMS ) 114. H.insert( new MyInteger( i ) ); 115. for( int i = 1; i < NUMS; i+= 2 ) 116. H.remove( new MyInteger( i ) ); 117. for( int i = 2; i < NUMS; i+=2 ) 118. if( ((MyInteger)(H.find( new MyInteger( i ) ))).intValue( ) != i ) 119. System.out.println( "Find fails " + i ); 120. for( int i = 1; i < NUMS; i+=2 ) 121. { 122. if( H.find( new MyInteger( i ) ) != null ) 123. System.out.println( "OOPS!!! " + i ); 124. } 125. } 126. }

Separate Chaining : More variation Can we use BST instead? Can we use BST instead? AVL? AVL? B-Tree? B-Tree?

Some Analysis Load factor Load factor It is an average length of linked list. It is an average length of linked list. Search time = time to do hashing + time to search list = constant + time to search list Unsuccessful search Unsuccessful search Search time == average list length == load factor

Successful search Successful search In a list that we will search, there is one node that contains an object that we want to find. There are other nodes too (0 or more). In a list that we will search, there is one node that contains an object that we want to find. There are other nodes too (0 or more). in a table, if we have N members, distributed into M lists. in a table, if we have N members, distributed into M lists. There are N-1 nodes that do not have what we want. There are N-1 nodes that do not have what we want. If we distribute these nodes evenly among the lists. Each list will have (N-1)/M nodes. If we distribute these nodes evenly among the lists. Each list will have (N-1)/M nodes. = lambda- (1/M) = lambda- (1/M) = lambda, because M is large. = lambda, because M is large. On average, half the list will be searched before we find what we want. That is, lambda/2 steps will be executed. On average, half the list will be searched before we find what we want. That is, lambda/2 steps will be executed. Therefore the average time to find the required element is 1 + (lambda/2) steps. Therefore the average time to find the required element is 1 + (lambda/2) steps. The tableSize is not important. What really matters is the load factor. The tableSize is not important. What really matters is the load factor.

Open Addressing Try to use another slot Try to use another slot “Probing” “Probing” Try h 0 (x), h 1 (x), … Try h 0 (x), h 1 (x), … h i (x)=[hash(x)+f(i)]%tableSize, f(0)=0 h i (x)=[hash(x)+f(i)]%tableSize, f(0)=0 “i” is the collision count “i” is the collision count Use no extra space Use no extra space Load factor is very important Load factor is very important

Open Addressing Technique Linear probing Linear probing Quadratic probing Quadratic probing Double hashing Double hashing

Fixing collision by using Open addressing No list. No list. If there is a collision, then keep calculating a new index until an empty slot is found. If there is a collision, then keep calculating a new index until an empty slot is found. The new index is at h 0 (x), h 1 (x), … The new index is at h 0 (x), h 1 (x), … h i (x)=[hash(x)+f(i)]%tableSize, f(0)=0 h i (x)=[hash(x)+f(i)]%tableSize, f(0)=0 Every data must be put into our table. Therefore the table must be large enough to distribute data. Every data must be put into our table. Therefore the table must be large enough to distribute data. Load factor <=0.5 Load factor <=0.5

Open addressing: linear probing F is a linear function of i. F is a linear function of i. Normally we have -> f(i)=i Normally we have -> f(i)=i It is “looking ahead one slot at a time.” It is “looking ahead one slot at a time.” This may take time. This may take time. There will be consecutive filled slots, called primary clustering. If a new collision takes place, it will take some time before we can find another empty slot.

Open addressing: quadratic probing There is no primary clustering by this method. There is no primary clustering by this method. We usually have -> f(i)=i 2 We usually have -> f(i)=i 2 h i (x)=[hash(x)+f(i)]%tableSize h i (x)=[hash(x)+f(i)]%tableSize a if b collides with a, we add 1 2 to find a new empty slot. If c also collides with a, we add 1 2 to find b. We need to go further by adding 2 2 instead.

However, if our table is more than half full or the tableSize is not prime, this method does not guarantee an empty slot. However, if our table is more than half full or the tableSize is not prime, this method does not guarantee an empty slot. But if the table is not yet half full and the tableSize is prime, it is proven that we can always find an empty slot for a new value. But if the table is not yet half full and the tableSize is prime, it is proven that we can always find an empty slot for a new value.

Proof Let the tableSize be a prime number greater than 3. Let the tableSize be a prime number greater than 3. Let (h(x)+i 2 ) mod tableSize Let (h(x)+i 2 ) mod tableSize (h(x)+j 2 ) mod tableSize (h(x)+j 2 ) mod tableSize Prove by contradiction Prove by contradiction Assume both positions are the same and i !=j. Assume both positions are the same and i !=j. Be 2 empty slot positions.

i-j =0 is impossible because we assumed they are not equal. i-j =0 is impossible because we assumed they are not equal. i+j=0 is also impossible, i+j=0 is also impossible, Therefore our assumption that the two positions are the same is wrong. Therefore our assumption that the two positions are the same is wrong. Thus the two positions are always different. Thus the two positions are always different. So there is always a slot for a new value, if the table is not yet half full and the tableSize is prime. So there is always a slot for a new value, if the table is not yet half full and the tableSize is prime.

Why prime? If not, the number of available slots will greatly reduce. If not, the number of available slots will greatly reduce. Example: tableSize == 16. Assume a normal hashing gives index ==0. (quadratic probing) Example: tableSize == 16. Assume a normal hashing gives index ==0. (quadratic probing) 1212 2 3232 4242 5252 6262 7272 You can see that they fall in the same positions.

Deleting in open addressing

Open addressing implementation class HashEntry { class HashEntry { Hashable element; // the element Hashable element; // the element boolean isActive; // false means -> deleted boolean isActive; // false means -> deleted public HashEntry( Hashable e ){ public HashEntry( Hashable e ){ this( e, true ); this( e, true ); } public HashEntry( Hashable e, boolean i ){ public HashEntry( Hashable e, boolean i ){ element = e; element = e; isActive = i; isActive = i; } }

1. public class QuadraticProbingHashTable{ 2. private static final int DEFAULT_TABLE_SIZE = 11; 3. /** The array of elements. */ 4. private HashEntry [ ] array; // The array of elements 5. private int currentSize; // The number of occupied cells 6. 6. 7. public QuadraticProbingHashTable( ){ 8. this( DEFAULT_TABLE_SIZE ); 9. } 10. /** 11. * Construct the hash table. 12. * @param size the approximate initial size. 13. */ 14. public QuadraticProbingHashTable( int size ){ 15. allocateArray( size ); 16. makeEmpty( ); 17. } nullactive nonactive

18. /** 19. * Internal method to allocate array. 20. * @param arraySize the size of the array. 21. */ 22. private void allocateArray( int arraySize ){ 23. array = new HashEntry[ arraySize ]; 24. } 25. /** 26. * Make the hash table logically empty. 27. */ 28. public void makeEmpty( ){ 29. currentSize = 0; 30. for( int i = 0; i < array.length; i++ ) 31. array[ i ] = null; 32. }

33. /** 34. * Return true if currentPos exists and is active. 35. * @param currentPos the result of a call to findPos. 36. * @return true if currentPos is active. 37. */ 38. private boolean isActive( int currentPos ){ 39. return array[ currentPos ] != null && array[ currentPos ].isActive; 40. }

41. /** 42. * Method that performs quadratic probing resolution. 43. * @param x the item to search for. 44. * @return the position where the search terminates. 45. */ 46. private int findPos( Hashable x ) { 47. /* 1*/ int collisionNum = 0; 48. /* 2*/ int currentPos = x.hash( array.length ); 49. /* 3*/ while( array[ currentPos ] != null && 50. !array[ currentPos ].element.equals( x ) ){ 51. /* 4*/ currentPos += 2 * ++collisionNum - 1; // Compute ith probe 52. /* 5*/ if( currentPos >= array.length ) // Implement the mod 53. /* 6*/ currentPos -= array.length; 54. } 55. /* 7*/ return currentPos; 56. } f(i)=i 2 =f(i-1)+2i-1

57. /** 58. * Find an item in the hash table. 59. * @param x the item to search for. 60. * @return the matching item. 61. */ 62. public Hashable find( Hashable x ){ 63. int currentPos = findPos( x ); 64. return isActive( currentPos ) ? array[ currentPos ].element : null; 65. }

66. /** 67. * Insert into the hash table. If the item is 68. * already present, do nothing. 69. * @param x the item to insert. 70. */ 71. public void insert( Hashable x ) 72. { 73. // Insert x as active 74. int currentPos = findPos( x ); 75. if( isActive( currentPos ) ) 76. return; //x is already inside, so do nothing 77. array[ currentPos ] = new HashEntry( x, true ); 78. // Rehash; see Section 5.5 79. if( ++currentSize > array.length / 2 ) 80. rehash( ); 81. }

hash, nextPrime, isPrime are the same as before. 97. /** 98. * Remove from the hash table. 99. * @param x the item to remove. 100. */ 101. public void remove( Hashable x ) 102. { 103. int currentPos = findPos( x ); 104. if( isActive( currentPos ) ) 105. array[ currentPos ].isActive = false; 106. }

rehashing Rehash can be done due to 3 situations. Rehash can be done due to 3 situations. Do it immediately when the table is half full. Do it immediately when the table is half full. Do it when our insert starts to fail. Do it when our insert starts to fail. Do it when a load factor is up to some value (Does not have to be 0.5) Do it when a load factor is up to some value (Does not have to be 0.5) Do not forget that the more the load factor value, the more difficult it is to insert. Do not forget that the more the load factor value, the more difficult it is to insert.

82. /** 83. * Expand the hash table. 84. */ 85. private void rehash( ) 86. { 87. HashEntry [ ] oldArray = array; 88. // Create a new double-sized, empty table 89. allocateArray( nextPrime( 2 * oldArray.length ) ); 90. currentSize = 0; 91. // Copy table over 92. for( int i = 0; i < oldArray.length; i++ ) 93. if( oldArray[ i ] != null && oldArray[ i ].isActive ) 94. insert( oldArray[ i ].element ); 95. return; 96. } recalculate index because this is a new array. O(N) because there are N members to be rehashed. This is not done often because the table has to be half filled first.

Downside of quadratic probing Secondary clustering Secondary clustering Fixed by double hashing: Fixed by double hashing: f(i) = i*hash 2 (x) f(i) = i*hash 2 (x) We find hash 2 (x), 2 *hash 2 (x), …and so on. We find hash 2 (x), 2 *hash 2 (x), …and so on. Must be careful when choosing a function. Must be careful when choosing a function. If our array has 9 slots and hash 2 (x) = x%9 -> if we insert 99, we will always get 0. If our array has 9 slots and hash 2 (x) = x%9 -> if we insert 99, we will always get 0. hash 2 (x) must not give 0. hash 2 (x) must not give 0.

Example of hash 2 Assume hash(x) = x%tableSize Assume hash(x) = x%tableSize hash 2 (x)=R-(x%R), R is prime and hash 2 (x)=R-(x%R), R is prime and R< tableSize Let our tableSize be 16. We insert 9, 25, 26, 41, 42, 58 respectively. Let our tableSize be 16. We insert 9, 25, 26, 41, 42, 58 respectively. 26925 25 collides, so we add 13-(25%13)=1 26 collides, so we add 13-(26%13)=13

412692542 41 collides, so we add 13-(41%13)=11 42 collides, so we add 13-(42%13)=10 but 42 still collides, so we add 2*10 from its original index.

58412692542 58 collides, so we add 13-(58%13)=7

Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD.

Similar presentations

Presentation on theme: "Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD.

Similar presentations

Presentation on theme: "Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD."— Presentation transcript:

Similar presentations

About project

Feedback