Heapsort Idea: two phases: 1. Construction of the heap 2. Output of the heap For ordering number in an ascending sequence: use a Heap with reverse order: the maximum number should be at the root (not the minimum). Heapsort is an in-situ-Procedure
2 Remembering Heaps: change the definition Heap with reverse order: For each node x and each successor y of x the following holds: m(x) m(y), left-complete, which means the levels are filled starting from the root and each level from left to right, Implementation in an array, where the nodes are set in this order (from left to right).
3 Second Phase: 2. Output of the heap: take n-times the maximum (in the root, deletemax) and exchange it with the element at the end of the heap. Heap is reduced by one element and the subsequence of ordered elements at the end of the array grows one element longer. cost: O(n log n). Heap Ordered elements Heap Ordered elements
4 First Phase: 1. Construction of the Heap: simple method: n-times insert Cost: O(n log n). making it better: consider the array a[1 … n ] as an already left-complete binary tree and let sink the elements in the following sequence ! a[n div 2] … a[2] a[1] (The elements a[n] … a[n div 2 +1] are already at the leafs.) HH The leafs of the heap
5 Formally: heap segment an array segment a[ i..k ] ( 1 i k <=n ) is said to be a heap segment when following holds: for all j from {i,...,k} m(a[ j ]) m(a[ 2j ]) if 2j k and m(a[ j ]) m(a[ 2j+1]) if 2j+1 k If a[i+1..n] is already a heap segment we can convert a[i…n] into a heap segment by letting a[i] sink.
6 Cost calculation Be k = [log n+1] (the height of the complete portion of the heap) cost: For an element at level j from the root: k – j. alltogether: {j=0,…,k} (k-j)2 j = 2 k {i=0,…,k} i/2 i =2 2 k = O(n).
7 advantage: The new construction strategy is more efficient ! Usage: when only the m biggest elements are required: 1. construction in O(n) steps. 2. output of the m biggest elements in O(mlog n) steps. total cost: O( n + mlog n).
8 Addendum: Sorting with search trees Algorithm: 1.Construction of a search tree (e.g. AVL-tree) with the elements to be sorted by n insert opeartions. 2.Output of the elements in InOrder-sequence. Ordered sequence. cost: 1. O(n log n) with AVL-trees, 2. O(n). in total: O(n log n). optimal!
9 7.2 External Sorting Problem: Sorting big amount of data, as in external searching, stored in blocks (pages). efficiency: number of the access to pages should be kept low! Strategy: Sorting algorithm which processes the data sequentially (no frequent page exchanges): MergeSort!
General form for Merge mergesort(S) # retorna el conjunto S ordenado { if(S es vacío o tiene sólo 1 elemento) return(S); else { Dividir S en dos mitades A y B; A'=mergesort(A); B'=mergesort(B); return(merge(A',B')); } 10
13 Meregesort en Archivos: Start: se tienen n datos en un archivo g 1, divididos en páginas de tamaño b: Page 1: s 1,…,s b Page 2: s b+1,…s 2b … Page k: s (k-1)b+1,…,s n ( k = [n/b] + ) Si se procesan secuencialmente se hacen k accesos a paginas, no n.
14 Variacion de MergeSort para external sorting MergeSort: Divide-and-Conquer-Algorithm Para external sorting: sin el paso divide, solo merge. Definicion: run := subsecuencia ordenada dentro de un archivo. Estrategia: by merging increasingly bigger generated runs until everything is sorted.
15 Algoritmo 1. Step: Generar del input file g 1 „starting runs“ y distribuirlas en dos archivos f 1 and f 2, con el mismo numero de runs ( 1) en cada uno (for this there are many strategies, later). Ahora: use 4 files f 1, f 2, g 1, g 2.
16 2. Step (main step): while (number of runs > 1) { Merge each two runs from f 1 and f 2 to a double sized run alternating to g 1 und g 2, until there are no more runs in f 1 and f 2. Merge each two runs from g 1 and g 2 to a double sized run alternating to f 1 and f 2, until there are no more runs in g 1 und g 2. } Each loop = two phases
17 Example: Start: g 1 : 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50 1st. step (length of starting run= 1): f 1 : 64 | 3 | 79 | 19 | 67 | 8 | 50 f 2 : 17 | 99 | 78 | 13 | 34 | 12 Main step, 1st. loop, part 1 (1st. Phase ): g 1 : 17, 64 | 78, 79 | 34, 67 | 50 g 2 : 3, 99 | 13, 19 | 8, 12 1st. loop, part 2 (2nd. Phase): f 1 : 3, 17, 64, 99 | 8, 12, 34, 67 | f 2 : 13, 19, 78, 79 | 50 |
18 Example continuation 1st. loop, part 2 (2nd. Phase): f 1 : 3, 17, 64, 99 | 8, 12, 34, 67 | f 2 : 13, 19, 78, 79 | 50 | 2nd. loop, part 1 (3rd. Phase): g 1 : 3, 13, 17, 19, 64, 78, 79, 99 | g 2 : 8, 12, 34, 50, 67 | 2nd. loop, part 2 (4th. Phase): f 1 : 3, 8, 12, 13, 17, 19, 34, 50, 64, 67, 78, 79, 99 | f 2 :
19 Implementation: For each file f 1, f 2, g 1, g 2 at least one page of them is stored in principal memory (RAM), even better, a second one might be stored as buffer. Read/write operations are made page-wise.
24 Costs Page accesses during 1. step and each phase: O(n/b) In each phase we divide the number of runs by 2, thus: Total number of accesses to pages: O((n/b) log n), when starting with runs of length 1. Internal computing time in 1 step and each phase is: O(n). Total internal computing time: O( n log n ).
25 Two variants of the first step: creation of the start runs A) Direct mixing sort in primary memory („internally“) as many data as possible, for example m data sets First run of a (fixed!) length m, thus r := n/m starting runs. Then we have the total number of page accesses: O( (n/b) log(r) ).
26 Two variants of the first step: creation of the start runs B) Natural mixing Creates starting runs of variable length. Advantage: we can take advantage of ordered subsequences that the file may contain Noteworthy: starting runs can be made longer by using the replacement-selection method by having a bigger primary storage !
27 Replacement-Selection Read m data from the input file in the primary memory (array). repeat { mark all data in the array as „now“. start a new run. while there is a „now“ marked data in the array { select the smallest (smallest key) from all „now“ marked data, print it in the output file, replace the number in the array with a number read from the input file (if there are still some) mark it „now“ if it is bigger or equal to the last outputted data, else mark it as „not now“. } Until there are no data in the input file.
28 Example: array in primary storage with capacity of 3 The input file has the following data: 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50 In the array: („not now“ data written in parenthesis) Runs : 3, 17, 64, 78, 79, 99 | 13, 19, 34, 67 | 8, 12, (19)7999 (19)(13)99 (19)(13)(67) (8)3467 (8)(12)67 (8)(12)(50)
29 Implementation: In an array: At the front: Heap for „now“ marked data, At the back: refilled „not now“ data. Note: all „now“ elements go to the current generated run.
30 Expected length of the starting runs using the replace-select method: 2m (m = size of the array in the primary storage = number of data that fit into primary storage) by equally probabilities distribution Even bigger if there is some previous sorting!
31 Multi-way merging Instead of using two input and two output files (alternating f 1, f 2 and g 1, g 2 ) Use k input and k output files, in order to me able to merge always k runs in one. In each step: take the smallest number among the k runs and output it to the current output file.
32 Cost: In each phase: number of runs is devided by k, Thus, if we have r starting runs we need only log k (r) phases (instead of log 2 (r)). Total number of accesses to pages: O( (n/b) log k (r) ). Internal computing time for each phase: O(n log 2 (k)) Total internal computing time: O( n log 2 (k) log k (r)) = O( n log 2 (r) ).