Advanced Algorithms Analysis and Design Lecture 10 Hashing,Heaps and Binomial trees
HASHING
Hash Tables All search structures so far Relied on a comparison operation Performance O(n) or O( log n) Assume I have a function f ( key ) integer ie one that maps a key to an integer What performance might I expect now?
Hash Tables - Keys are integers Need a hash function h( key ) integer ie one that maps a key to an integer Applying this function to the key produces an address If h maps each key to a unique integer in the range 0.. m-1 then search is O(1)
Hash Tables - Hash functions Form of the hash function Example - using an n -character key int hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; } returns a value in xor function is also commonly used sum = sum ^ *s++; Example hash( “AB”, 2 ) and hash( “BA”, 2 ) return the same value! This is called a collision A variety of techniques are used for resolving collisions
6 Hashing: Collision Resolution Schemes Collision Resolution Techniques Separate Chaining Separate Chaining with String Keys Separate Chaining versus Open-addressing Implementation of Separate Chaining Introduction to Collision Resolution using Open Addressing Linear Probing
7 Collision Resolution Techniques There are two broad ways of collision resolution: 1. Separate Chaining:: An array of linked list implementation. 2. Open Addressing: Array-based implementation. (i) Linear probing (linear search) (ii) Quadratic probing (nonlinear search) (iii) Double hashing (uses two hash functions)
8 Separate Chaining The hash table is implemented as an array of linked lists. Inserting an item, r, that hashes at index i is simply insertion into the linked list at position i. identicals are chained in the same linked list.
9 Separate Chaining (cont’d) Retrieval of an item, r, with hash address, i, is simply retrieval from the linked list at position i. Deletion of an item, r, with hash address, i, is simply deleting r from the linked list at position i. Example: Load the keys 23, 13, 21, 14, 7, 8, and 15, in this order, in a hash table of size 7 using separate chaining with the hash function: h(key) = key % 7 h(23) = 23 % 7 = 2 h(13) = 13 % 7 = 6 h(21) = 21 % 7 = 0 h(14) = 14 % 7 = 0 collision h(7) = 7 % 7 = 0 collision h(8) = 8 % 7 = 1 h(15) = 15 % 7 = 1 collision
10 Separate Chaining with String Keys Recall that search keys can be numbers, strings or some other object. A hash function for a string s = c0c1c2…cn-1 can be defined as: hash = (c 0 + c 1 + c 2 + … + c n-1 ) % tableSize this can be implemented as: Example: The following class describes commodity items: public static int hash(String key, int tableSize){ int hashValue = 0; for (int i = 0; i < key.length(); i++){ hashValue += key.charAt(i); } return hashValue % tableSize; } class CommodityItem { String name;// commodity name int quantity;// commodity quantity needed double price;// commodity price }
Separate Chaining with String Keys (cont’d) Use the hash function hash to load the following commodity items into a hash table of size 13 using separate chaining: onion tomato cabbage carrot okra mellon potato Banana olive salt cucumber mushroom orange Solution: hash(onion) = ( ) % 13 = 547 % 13 = 1 hash(salt) = ( ) % 13 = 436 % 13 = 7 hash(orange) = ( )%13 = 636 %13 = 12
12 Separate Chaining with String Keys (cont’d) onion okra mellon banana tomatoolive cucumber mushroom salt cabbage carrot potato orange ItemQtyPrice h(key) onion tomato cabbage carrot okra mellon potato Banana olive salt cucumber mushroom orange
13 Separate Chaining versus Open-addressing Organization Advantages Disadvantages Chaining Unlimited number of elements Unlimited number of collisions Overhead of multiple linked lists
14 Introduction to Open Addressing All items are stored in the hash table itself. In addition to the cell data (if any), each cell keeps one of the three states: EMPTY, OCCUPIED, DELETED. While inserting, if a collision occurs, alternative cells are tried until an empty cell is found. Deletion: (lazy deletion): When a key is deleted the slot is marked as DELETED rather than EMPTY otherwise subsequent searches that hash at the deleted cell will be unsuccessful. Probe sequence: A probe sequence is the sequence of array indexes that is followed in searching for an empty cell during an insertion, or in searching for a key during find or delete operations. The most common probe sequences are of the form: h i (key) = [h(key) + c(i)] % n, for i = 0, 1, …, n-1. where h is a hash function and n is the size of the hash table The function c(i) is required to have the following two properties: Property 1: c(0) = 0 Property 2: The set of values {c(0) % n, c(1) % n, c(2) % n,..., c(n-1) % n} must be a permutation of {0, 1, 2,..., n – 1}, that is, it must contain every integer between 0 and n - 1 inclusive.
15 Introduction to Open Addressing (cont’d) The function c(i) is used to resolve collisions. To insert item r, we examine array location h 0 (r) = h(r). If there is a collision, array locations h 1 (r), h 2 (r),..., h n-1 (r) are examined until an empty slot is found. Similarly, to find item r, we examine the same sequence of locations in the same order. Note: For a given hash function h(key), the only difference in the open addressing collision resolution techniques (linear probing, quadratic probing and double hashing) is in the definition of the function c(i). Common definitions of c(i) are: Collision resolution techniquec(i) Linear probingi Quadratic probing±i 2 Double hashingi*h p (key) where h p (key) is another hash function.
16 Introduction to Open Addressing (cont'd) Advantages of Open addressing: All items are stored in the hash table itself. There is no need for another data structure(MEANS NO LINKLIST). Open addressing is more efficient storage-wise. Disadvantages of Open Addressing: The keys of the objects to be hashed must be distinct. Dependent on choosing a proper table size. Requires the use of a three-state (Occupied, Empty, or Deleted) flag in each cell.
Open Addressing Facts In general, the best table size is most important. With any open addressing method of collision resolution, as the table fills, there can be a severe degradation in the table performance. Hashing has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. i.e When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled by calling the rehash method. As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. The load factor of the table is m/N, where m is the number of distinct indexes used in the table or is the number of records currently in the table. and N is the size of the array used to implement it. Load factors between 0.6 and 0.7 are common. Load factors > 0.7 are undesirable.
18 Open Addressing : Linear Probing (cont’d) Example: Perform the operations given below, in the given order, on an initially empty hash table of size 13 using linear probing with c(i) = i and the hash function: h(key) = key % 13: insert(18), insert(26), insert(35), insert(9), find(15), find(48), delete(35), delete(40), find(9), insert(64), insert(47), find(35) The required probe sequences are given by: h i (key) = (h(key) + i) % 13 i = 0, 1, 2,..., 12
19 a IndexStatusValue 0O26 1E 2E 3E 4E 5O18 6E 7E 8O47 9D35 10O9 11E 12O64 Linear Probing (cont’d)
20 Disadvantage of Linear Probing: Primary Clustering Linear probing is subject to a primary clustering phenomenon. Elements tend to cluster around table locations that they originally hash to. Primary clusters can combine to form larger clusters. This leads to long search sequences and hence deterioration in hash table efficiency. Example of a primary cluster: Insert keys: 18, 41, 22, 44, 59, 32, 31, 73, in this order, in an originally empty hash table of size 13, using the hash function h(key) = key % 13 and c(i) = i: h(18) = 5 h(41) = 2 h(22) = 9 h(44) = 5+1 h(59) = 7 h(32) = h(31) = h(73) =
HEAPS
Heaps A heap is a special kind of rooted tree that can be implemented efficiently in an array without any explicit pointers. It can be used for heap sort and the efficient representation of certain dynamic priority lists, such as the event list in a simulation or the list of tasks to be scheduled by an operating system. A heap is an essentially complete binary tree
Heaps Figure illustrates an essentially complete binary tree containing 10 nodes. The five internal nodes occupy level 3 (the root), level 2, and the left side of level 1; the five leaves fill the right side of level 1 and then continue at the left of level 0. If an essentially complete binary tree has height k, then there is one node (the root) on level k, there are two nodes on level k-1 and so on; there are 2 k-1 nodes on level 1, and at least 1 and not more than 2 k on level 0. A heap is an essentially complete binary tree, each of whose nodes includes an element of information called the value of the node, and which has the property that the value of each internal node is greater than or equal to the values of its children.
An essentially complete binary tree T[1] T[3] T[2] T[4] T[8]T[9] T[5] T[10] T[6] T[7]
A heap Figure shows an example of a heap with 10 nodes.
Heaps Now we have marked each node with its value. This same heap can be represented by the following array The crucial characteristic of this data structure is that the heap property can be restored efficiently if the value of a node is modified. If the value of a node increases to the extent that it becomes greater than the value of its parent, it should be sufficient to exchange these two values, and then to continue the same process upwards in the tree if necessary until the heap property is restored. The modified value is percolated up to its new position in the heap This operation is often called sifting up If the value 1 in Figure is modified so that it becomes 8, we can restore the heap property by exchanging the 8 with its parent 4, and then exchanging it again with its new parent 7.
The heap, after percolating 8 to its place
Heaps If on the contrary the value of a node is decreased so that it becomes less than the value of at least one of its children, it suffices to exchange the modified value with the larger of the values in the children, and then to continue this process downwards in the tree if necessary until the heap property is restored. The modified value has been sifted down to its new position. 9 The heap, after sifting 3 (originally 10) down to its place
Heaps The following procedures describe more formally the basic processes for manipulating a heap. Procedure alter-heap (T[1..n], i, v) {T[1..n] is a heap. The value of T[i] is set to v and the heap property is re-established. Suppose that 1≤ i ≤ n.} x ← T[i] T[i] ← v if v < x then sift-down(T,i) else percolate (T,i)
Procedure sift-down (T[1…n], i) {This procedure sifts node i down so as to re-establish the heap property in T[1..n]. Suppose that T would be a heap if T[i] were sufficiently large and that 1≤ i ≤ n.} k ← i repeat j ← k {find the larger child of node j} if 2j ≤ n and T[2j]> T[k] then k ← 2j if 2j T[k] then k ← 2j+1 exchange T[j] and T[k] {if j=k, then the node has arrived at its final position} until j=k
Procedure percolate (T[1…n], i) {This procedure percolate node i so as to re-establish the heap property in T[1..n]. Suppose that T would be a heap if T[i] were sufficiently small and that 1≤ i ≤ n. The parameter n is not used here} k ← i repeat j ← k if j > 1 and T[j ÷ 2]< T[k] then k ← j ÷ 2 exchange T[j] and T[k] {if j=k, then the node has arrived at its final position} until j=k
Heaps Heap is an ideal data structure for finding the largest element of a set, removing it, adding a new node, or modifying a node. These are exactly the operations we need to implement dynamic priority lists efficiently. The value of a node gives the priority of the corresponding event, the event with highest priority is always found at the root of the heap, and the priority of an event can be changed dynamically at any time. This is particularly useful in computer simulations and in the design of schedulers for an operating system. Some typical procedures are illustrated below.
function find-max (T[1..n]) {Returns the larges element of the heap T[1..n]} return T[1] Procedure delete-max (T[1…n]) {Removes the largest element of the heap T[1..n] and restores the heap property in T[1..n - 1]} T[1] ← T[n] sift-down( T[1..n - 1], 1)
Procedure insert node (T[1…n], v) {Adds an element whose value is v to the heap T[1..n] and restores the heap property in T[1..n + 1]} T[n+1] ← v percolate(T[1..n + 1], n+1)
Heaps There exists a cleverer algorithm for making a heap. Suppose, for example, that our starting point is the following array represented by the tree in Figure The starting situation
Heaps We first make each of the subtrees whose roots are at level 1 into a heap, this is done by sifting down these tools, as illustrated in Figure The level 1 subtrees are made into heaps
Heaps This figure shows the process for the left subtree. The other subtree at level 2 is already a heap. This results in an essentially complete binary tree corresponding to the array One level 2 subtree is made into a heap (the other already is a heap)
It only remains to sift down its root to obtain the desired heap. This process thus goes as follows:
Construct the heap using the array A=(16, 4, 10, 14, 7, 9, 3, 2, 8, 1) Maintaining heap property 2 The initial configuration How to Sort Heap
A= for i←n down 2 do exchange T(1) and T(i) 4 Sift-down(T[..i-1],1) Make-heap (T) How to Sort Heap
i = 10, exchange T[1] & T[10] and sift-down (T[1..9],1) i = 9, exchange T[1] & T[9] and sift-down (T[1..8],1) i = 8, exchange T[1] & T[8] and sift-down (T[1..7],1) i = 7, exchange T[1] & T[7] and sift-down (T[1..6],1) i = 6, exchange T[1] & T[6] and sift-down (T[1..5],1) i = 5, exchange T[1] & T[5] and sift-down (T[1..4],1)
i = 4, exchange T[1] & T[4] and sift-down (T[1..3],1) i = 3, exchange T[1] & T[3] and sift-down (T[1..2],1) i = 2, exchange T[1] & T[2] and sift-down (T[1..1],1) End of Sorting
Sorted heap
B0B0 B1B1 B2B2 B3B3 B4B4 Binomial trees B 0 to B 4 Binomial trees
max A binomial heap containing 11 items Parent node greater than child node: Max binomial heap
Linking two B 2 ’s to make a B 3
merged with 6 yields Merging two binomial heaps
head [H 1 ] 12 head [H 2 ] BINOMIAL-HEAP MERGE Note: Check Heap type max /min before start
head [H]
head [H]
head [H] The node with value 1 to be deleted (b) head [H] Separated into two heaps
head [H] head [H] Node with value 1 has been deleted, two heaps H & H head [H] Merging heaps H & H
head [H] y7y7 42 Node y value decreased from 26 to 7 head [H]
head [H]
head [H]