SORTING, SEARCHING AND HASHING TECHNIQUES Sorting is the process of arranging records in order of their keys. Sorting algorithm is an algorithm that puts elements of a list in a certain order. Applications: Dictionaries Telephone Directories
Sorting Theme
Internal sort and External sort It takes place in the main memory of a computer. External sort It takes place in the secondary memory of a computer.
Insertion sort It uses two pieces of data to sort: sorted and unsorted. In each pass of a sort, one or more pieces of data are inserted into their correct location. The straight insertion sort The shell sort
In each pass, the first element of the the unsorted sublist is transfered to the sorted sublist by insertion it at appropiate place.
Working of Insertion Sort
Algorithm for Insertion Sort void insertion_sort( input_type a[ ], unsigned int n ) { unsigned int j, p; input_type tmp; /*1*/ a[0] = MIN_DATA; /* sentinel */ /*2*/ for( p=2; p <= n; p++ ) /*3*/ tmp = a[p]; /*4*/ for( j = p; tmp < a[j-1]; j-- ) /*5*/ a[j] = a[j-1]; /*6*/ a[j] = tmp; }}
Shell Sort The creator of this algorithm is Donald L.Shell. The list is divided into K segments, where K is known as the increment value. Each segment contains N/K or less elemets.
Shell Sort
k=last/2 k = 10/2=5
k=5/2
k=3/2
Algorithm for Insertion Sort shellsort( input_type a[ ], unsigned int n ) { unsigned int i, j, increment; input_type tmp; /*1*/ for( increment = n/2; increment > 0; increment /= 2 ) /*2*/ for( i = increment+1; i<=n; i++ ) /*3*/ tmp = a[i]; /*4*/ for( j = i; j > increment; j -= increment ) /*5*/ if( tmp < a[j-increment] ) /*6*/ a[j] = a[j-increment]; else /*7*/ break; /*8*/ a[j] = tmp; }
Selection Sort The list is divided into two sublists, sorted and unsorted. Select the smallest element from unsorted sublist and exchange it with the element at the beginning of the unsorted data. The wall between two sublists moves one element.
Working of Selection Sort
Selection Sort Algorithm algorithm selectionSort( ref list <array>, val last <index>) Sort list[1..last] by selecting smallest element in unsorted portion of array and exchanging it with element at the beginning of the unordered list. Pre List is must contain at least one item. last is an index to last record in array. Post List has been rearranged smallest to largest.
loop (current < last) smallest = current walker = current +1 loop (walker <= last) if ( list[walker] < list[smallest]) smallest = walker walker = walker + 1 smallest selected: exchange with current element! exchange (list, current, smallest) current = current + 1 return end selectionSort
Exchange Sort Bubble Sort The list is divided into two sublists; sorted and unsorted. The smallest element is bubbled from the unsorted to the sorted list. The wall moves one element to the right. This sort requires n-1 passes to sort the data.
Exchange Sort Bubble Sort walker current
Exchange Sort Bubble Sort algorithm bubbleSort( ref list <array>, val last <index>) Sort an array, list[1..last] using buble sort.Adjacent elements are compared and exchanged until list is completely ordered. Pre List is must contain at least one item. last is an index to last record in array. Post List has been rearranged smallest to largest.
Exchange Sort Bubble Sort current = 1 sorted = false loop (current <= last AND sorted false) walker = last sorted = true loop (walker > current) if (list[walker] < list[walker-1]) exchange (list, walker, walker-1) walker = walker – 1 current = current + 1 return end bubleSort executes n times executes (n+1)/2 times f(n) = n((n+1)/2) O(n2)
Exchange Sort In the bubble sort, consecutive items are compared and possibly exchanged on each pass through the list, which means that many exchanges may be needed to move an element to its correct position. Quick sort is exchanged involves elements that are far apart so that fewer exchanges are required to correctly position an element.
Exchange Sort Quick Sort Selects an element which is called as “pivot” for each iteration. Divides the list into three groups. The elements of first group which has key values less then key of pivot. The pivot element. The elements of first group which has key values greater then key of pivot. The sorting continues by quick sorting the left partition followed by a quick sort of the right partition.
Exchange Sort Quick Sort > >
Exchange Sort Quick Sort >= From left! From right! 21 < 62 78 >62 84 >62 14 < 62 45 < 62 97 > 62
Exchange Sort Quick Sort Operation Figure 11-16
MergeSort is a divide and conquer method of sorting. Procedure Divide the list into two smaller lists of about equal sizes. Sort each smaller list recursively. Merge the two sorted lists to get one sorted list.
MergeSort Algorithm MergeSort is a recursive sorting procedure. It uses at most O(n lg(n)) comparisons. To sort an array of n elements, we perform the following. If n < 2 then the array is already sorted. Otherwise, n > 1, Sort the left half of the the array using MergeSort. Sort the right half of the the array using MergeSort. Merge the sorted left and right halves.
Example
Linear List Searches Hashed List Searches Outline Sequential Search The sentinel search, The probability search, The ordered search. Binary Search Hashed List Searches Collision Resolution
Linear List Searches We study searches that work with arrays. Figure 2-1
Linear List Searches There are two basic searches for arrays The sequential search. It can be used to locate an item in any array. The binary search. It requires an ordered list.
Linear List Searches Sequential Search The list is not ordered! We will use this technique only for small arrays. We start searching at the beginning of the list and continue until we find the target entity. Either we find it, or we reach the end of the list!
Locating data in unordered list.
Sequential Search Algorithm algorithm SeqSearch (val list <array>, val last <index>, val target <keyType>, ref locn <index>) Locate the target in an unordered list of size elements. PRE list must contain at least one element. last is index to last element in the list. target contains the data to be located. locn is address of index in calling algorithm. POST if found – matching index stored in locn & found TRUE if not found – last stored in locn & found FALSE RETURN found <boolean>
Sequential Search Algorithm looker = 1 loop (looker < last AND target not equal list(looker)) looker = looker + 1 locn = looker if (target equal list(looker)) found = true else found = false return found end SeqSearch Big-O(n)
Binary Search Test the data in the element at the middle of the array. If it is in the first half! If it is in the second half! Test the data in the element at the middle of the array. Test the data in the element at the middle of the array. If it is in the first half! If it is in the second half! If it is in the first half! If it is in the second half! . . . .
mid=(first+last)/2 target > mid first = mid +1 target < mid last = mid -1 Figure 2-4
first becomes larger than last! Figure 2-5
Binary Search Algorithm algorithm BinarySearch(val list <array>, val last <index>, val target <keyType>, ref locn <index>) Search an ordered list using binary search. PRE list is ordered:it must contain at least one element. last is index to the largest element in the list. target is the value of element being sought. locn is address of index in calling algorithm. POST Found : locn assigned index to target element. found set true. Not found: locn = element below or above target. found set false. RETURN found <boolean>
Binary Search Algorithm first = 1 last = end loop (first <= last) mid = (first + last)/2 if (target > list[mid]) first = mid + 1 (Look in upper half). else if (target < list[mid] last = mid – 1 (Look it lower halt). else first = last + 1 (Found equal: force exit) locn = mid if (target equal list[mid]) found = true found = false Return end BinarySearch Big-O(log2n)
Hashed List Searches In an ideal search, we would know exactly where the data are and go directly there. We use a hashing algorithm to transform the key into the index of array, that contains the data we need to locate.
Hashing Methods Figure 2-8
Direct Hashing Method The key is the address without any algorithmic manipulation. The data structure must contain an element for every possible key. It quarantees that there are no synonyms. We can use direct hashing very limited!
It is a key-to-address transformation! Figure 2-6
We call set of keys that hash to the same location in our list synonymns. A collision is the event that occurs when a hashing algorithm produces an address for an insertion key and that address is already occupied. Each calculation of an address and test for success is known as a probe. Figure 2-7
Subtraction Hashing Method The keys are consecutive and do not start from one. Example: A company have 100 employees, Employee numbers start from 1000 to 1100. Ali Esin 1 2 Sema Metin x=1001 1 2 x=1002 x – 1000 x=1100 100 99 100 Filiz Yılmaz
Modulo Division Hashing Method The modulo-division method divides the key by the array size and uses remainder plus one for the address. address = key mod (listSize) + 1 If a list size selected a prime number, that produces fewer collisions than other list sizes.
Modulo Division Hashing Method 121267 / 307 = 395 and remainder = 2 hash(121267)= 2 +1 = 3 We have 300 employees, and the first prime greater that 300 is 307!. Figure 2-10
Digit Extraction Method Selected digits are extracted from the key and used as the address. Example: 379452 394 121267 112 378845 388 526842 568
Midsquare Hashing Method The key is squared and the address selected from the middle of the squared number. The most obvious limitation of this method is the size of the key. Example: 9452 * 9452 = 89340304 3403 is the address. Or 379452 379 * 379 = 143641 364
Folding Hashing Method Figure 2-11
Pseudorandom Hashing Method The key is used as the seed in a pseudorandom number generator and resulting random number then scaled in to a possiple address range using modulo division. Use a function such as: y = (ax + b (mod m))+1 x is the key value, a is coefficient, b is a constant. m is the count of the element in the list. y is the address.
Pseudorandom Hashing Method y = (ax + b (mod m)) + 1 y = (17x + 7 (mod 307)) + 1 x = 121267 is the key value, a = 17 b = 7 m =307 y = ((( 17 * 121267) + 7) mod 307) + 1 y = ((2061539 +7) mod 307) + 1 y = 2061546 mod 307 + 1 y = 41 + 1 y = 42
Rotation Hashing Method Rotation is often used in combination with folding and psuedorandom hashing. Figure 2-12
Collision Resolution Methods All above methods of handling collision are independent of the hashing algorithm. Figure 2-13
Collision Resolution Concepts “Load Factor” We define a full list, as a list in which all elements except one contain data. Rule: A hashed list should not be allowed to become more than %75 full! the number of filled elements in the list Load Factor = ------------------------------------------------------ x 100 total number of elements in the list k α = --------- x 100 the number of elements n
Collision Resolution Concepts “Clustering” Some hashing algorithms tend to couse data to group within the list. This is known as clustering. Clustering is created by collision. If the list contains a high degree of clustering, then the number of probes to locate an element grows and the processing efficiency of the list is reduced.
Collision Resolution Concepts “Clustering” Clustering types are: Primary clustering; clustering around a home address in our list. Secondary clustering; the data are widely distributed across the whole list so that the list appears to be well distributed, however, the time to locate a requested element of data can become large.
Collision Resolution Methods Open Addressing When a collision occurs, the home area addresses are searched for an open or unoccupied element where the new data can be placed. We have four different method: Linear probe, Quadratic probe, Double hashing, Key offset.
Open Addressing “Linear Probe” When data cannot be stored in the home address, we resolve the collision by adding one to the current address. Advantage: Simple implementation! Data tend to remain near their home address. Disadvantages: It tends to produce primary clustering. The search algorithm may become more complex especially after data have been deleted! .
Open Addressing “Linear Probe” 15532 / 307 = 50 and remainder = 2 hash(15532)= 2 +1 = 3 New address = 3+1 =4
Open Addressing “Linear Probe” Figure 2-14
Open Addressing “Quadratic Probe” Clustering can be eliminated by adding a value other than one to the current address. The increment is the collision probe number squared. For the first probe 12 For the second probe 22 For the third collision probe 32 ... Until we eighter find an empty element or we exhoust the possible elements. We use the modulo of the quadratic sum for the new address.
Open Addressing “Quadratic Probe” Increase by two Fore each probe! Probe Number Collision Location Probe*Probe & Increment New Address Factor Next 1 1*1=1 2 2*2=4 6 3 4 3*3=9 15 5 9 4*4=16 31 7 16 5*5=25 56 25 6*6=36 92 11 36 7*7=49 41 13 49 + + +
Open Addressing – Double Hashing “Pseudorandom Collision Resolution” In this methot, rather than using an arithmetic probe functions, the address is rehashed. y = ((ax + c) mod listSize) +1 y = ((3.2 +(-1) mod 307) +1 y = 6 Figure 2-15
Open Addressing – Double Hashing “Key Offset Collision Resolution” Key offset is another double hashing method and, produces different collision paths for different keys. Key offset calculates the new address as a function of the old address and the key.
Open Addressing – Double Hashing “Key Offset Collision Resolution” offSet = [key / listSize] address = ((offSet + old address) mod listSize) + 1 offSet = [166702 / 307] = 543 1. Probe : address = ((543 + 2) mod 307) + 1 = 239 2. Probe : address = ((543 + 239) mod 307) + 1 = 169 Key Home Address Key Offset Probe 1 Probe 2 166702 2 543 239 169 572556 1865 26 50 67234 219 222 135
Collision Resolution Open Addressing Resolution A major disadvantage to open addressing is that each collision resolution increases the probability of future collisions!
Collision Resolution Linked List Resolution Link head pointer. A link list is an ordered collection of data in which each element contains the location of the next element. Figure 2-16
Collision Resolution Bucket Hashing Resolution Figure 2-17
Rehashing If the table gets too full, the running time for the operations will start taking too long and Inserts might fail for closed hashing with quadratic resolution. This can happen if there are too many deletions intermixed with insertions. A solution, then, is to build another table that is about twice as big (with associated new hash function) and scan down the entire original hash table, computing the new hash value for each (non-deleted) element and inserting it in the new table.
As an example, suppose the elements 13, 15, 24, and 6 are inserted into a closed hash table of size 7. The hash function is h(x) = x mod 7. Suppose linear probing is used to resolve collisions.
Rehashing If 23 is inserted into the table, the resulting table in Figure will be over 70 percent full. Because the table is so full, a new table is created. The size of this table is 17, because this is the first prime which is twice as large as the old table size. The new hash function is then h(x) = x mod 17. The old table is scanned, and elements 6, 15, 23, 24, and 13 are inserted into the new table. This entire operation is called rehashing.
Closed hash table after rehashing
Extendible Hashing If either open hashing or closed hashing is used, the major problem is that collisions could cause several blocks to be examined during a find, even for a well-distributed hash table. A clever alternative, known as extendible hashing. Let us suppose, for the moment, that our data consists of several six-bit integers. shows an extendible hashing scheme for this data. The root of the "tree" contains four pointers determined by the leading two bits of the data. Each leaf has up to m = 4 elements.
D will represent the number of bits used by the root, which is sometimes known as the directory. The number of entries in the directory is thus 2D. dl is the number of leading bits that all the elements of some leaf l have in common.
Suppose that we want to insert the key 100100 Suppose that we want to insert the key 100100. This would go into the third leaf, but as the third leaf is already full, there is no room. We thus split this leaf into two leaves, which are now determined by the first three bits. This requires increasing the directory size to 3.
If the key 000000 is now inserted, then the first leaf is split, generating two leaves with dl = 3. Since D = 3, the only change required in the directory is the updating of the 000 and 001 pointers.