Download presentation
Presentation is loading. Please wait.
Published byShannon Alexander Modified over 8 years ago
1
CENG 3511 External Sorting
2
CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort
3
CENG 3513 External Sort External sorting is used when the file to be sorted is too big to be in the main memory in full. In this case the file is sorted using hard disk storage as well as main memory. External sorting can be done using one disk or more than one, depending on the size of the file and the availability of hard disk drives. What sort algorithm is suitable for the external sort? The algorithm that allows large files to be sorted! Heap sort seems to be appropriate for external sort, with average complexity of O(nlogn) Quick sort is appropriate for in memory sort, with avrage complexity of O(nlogn).
4
CENG 3514 Why Heap sort Large files are handled easily: –First phase involves sorting the large file in memory size chunks (file segments) –Second phase involves merging the sorted segments, on the disk, each time creating a twice as large sorted segments –Also overlapping of processing with I/O is possible with the Heap Sort… –In two disk drive case Input and output can even be overlapped as well, with further advantage…
5
CENG 3515 Sorting Segments 1.Producing sorted segments of memory sıze each time Heap sort is also known as to be based on a “priority sort queue”. It can be executed by overlapping the input/output with processing. Each sorted segment can be the size of the available memory. 2.Producing a single sorted segment of twice as much as the memory size (known as replacement selection) Sorted segments are twice the size of memory. Reading in and writing out can be overlapped
6
CENG 3516 Heapsort Algorithm What is a heap? A heap is a binary tree with the following properties: 1.Each node has a single key and that key is greater than or equal to the key at its parent node. 2.It is a complete binary tree. i.e. All leaves are on at most 2 levels, leaves on the lowest level are at the leftmost position. 3.Can be stored in an array; the root is at index 1, the children of node i are at indexes 2*i, and 2*i+1. Conversely, the parent of node j is stored at index j/2 (very compact: no need to store pointers)
7
CENG 3517 How big is a heap? As big as the largest heap that can fit in memory, called one segment What is the complexity of the heap sort? –Each time a new record is added to the tree it requires a maximum of logn comparisons to reestablish the heap, if n is the number of the nodes in the tree at that moment. For example, for a tree of 1024 records, logn will be 10.
8
CENG 3518 How big is a heap? Overlapping of block input and in memory heap processing is normal: If a block contains 10 records; while the next block is being read, 100 (=10*10) comparison can be conducted, for a tree of 1024 nodes. So, larger block, better concurrency… How to conduct the second stage of the heap sort, i.e., writing the segment out to the disk? –Write the root of the heap established during the first stage, to the output buffer –Replace the root by the last record in the tree –Reestablish the heap, which has the complexity of O(logn)
9
CENG 3519 10 3520 4540 605055 2530 Heap as a binary tree: Height = log n Example Heap as an array: 10352045402530605055
10
CENG 35110 Heapsort Algorithm First Stage: Building the heap while reading the file: –While there is available space Get the next record from current input buffer Put the new record at the end of the heap Reestablish the heap by exchanging the new node with its parent, if it is smaller than the parent: otherwise leave it, where it should be. Repeat this step as long as heap property is violated.
11
CENG 35111 Heapsort Algorithm Second stage: Sorting while writing the heap out to the file: –While there are records in heap Put the root record (smallest key record) in the current output buffer. Replace the root by the last record in the heap. Restore the heap again, which has the complexity of O(logn) and repeat this procedure until the heap is empty, –so overall complexity is O(nlogn).
12
CENG 35112 Example Trace the algorithm with the following input: 48703019504510015 First establish heap, then write-out in descending order…
13
CENG 35113 Sort of large files! Large files can be sorted in two ways –Use of keysort when the all the keys can fit into the main memory, with pointers to the actual location of the records on the disk: This is inefficient for record access and also does not work when the memory is not enough even for the keys. This method is not discussed further because of its limitations. –Use MERGE operation: This method can work for a memory of any size…
14
CENG 35114 Merge of sorted lists
15
CENG 35115 Multiway Merging K-way merge: we wat to merge K sorted input lists to create a sorted single output list. (K is called order of a K-way merge) K-way merge algorithm: –Keep an array of lists: list[0], list[1], … list[k-1] –Keep an array of the items that are being used from each list: item[0], item[1], … item[k-1] –The merge processing requires a call to a function (say MinIndex) to find the index of the item with the minimum of k input items.
16
CENG 35116 Merging as a way of Sorting Large Files Let us consider the following example: File to be sorted: –8,000,000 records –Record size= 100 bytes –Size of the key = 10 bytes Memory available as a work area: 10MB (not counting memory used to hold programs, O.S., I/O buffers etc.) Total file size = 800MB Total number of bytes for all keys = 80MB So, we cannot do internal sorting nor key-sorting.
17
CENG 35117 Basic idea 1.Forming segments (i.e. sorted sub-files): bring as many records as possible to the main memory, sort them using heapsort, save the sorted data into a file. Repeat this until we have read all the records from the original file. As a result a number of sorted files will be formed… 2.Do a multi-way merge on the sorted sub- files to form larger sorted files.
18
CENG 35118 Cost of Merge Sort-1 I/O operations are performed in the following times: 1.Reading each block of records into main memory and sort them using heap sort. 2.Writing sorted segments to disk. These two steps are done as follows: –Read a chunk of 10MB, write a chunk of 10Mb (repeat this 80 times for the given file of 800MB) –In terms of basic disk operations, we spend: For reading: 80 (s+r) + 80*btt Same for writing
19
CENG 35119 3.Reading segments into memory for merging. Read one chunk of each segment, so 80 chunks. Since available memory is 10MB each chunk can have (10,000,000/80)bytes = 125,000 bytes = 1250 records. Total number of s and r = Total number of segments * total number chunks per segmemts = 80*80=80 2 chunks Which means 6400 (s + r). Cost of Merge Sort-2
20
CENG 35120 Cost of Merge Sort-3: For very Large Files! How is the time for merge phase affected if the file is 80 million records? –Number of segments: 800 segments –800-way merge to use 10MB memory i.e. divide the memory into 800 buffers. –Each buffer holds 1/800 th of a segment –So, 800 segments * 800 reads/segment = 640,000 reads.
21
CENG 35121 The cost of increasing the file size In general, for a K-way merge of K segments, the buffer size for each segment is –(1/K) of the size of memory space = (1/K) of the size of each segment So K reads are required for each segment. Since there are K segments, merge requires K 2 seeks. Because K is directly proportional to N it also follows that the sort merge complexity is of order O(N 2 ) operation.
22
CENG 35122 Improvements There are several ways to reduce the time: 1.Allocate more hardware (e.g. Disk drives, memory) 2.Perform merge in more than one step, where K is less than the total number of sorted segments 3.Algorithmically increase the lengths of the initial sorted segments 4.Find ways to overlap I/O operations.
23
CENG 35123 Multiple-step merges Instead of merging all segments at once, we break the original set of segments into small groups and merge the segments in these groups separately. –more buffer space is available for each segment; hence fewer seeks are required per segment. When all of the smaller merges are completed, a second pass merges the larger segments.
24
CENG 35124 ………… … … Example: 25 sets of 32 segments each Two-step merge of 800 segments
25
CENG 35125 Example: Cost of multi-step merge 25 sets of 32 segments, followed by 25-way merge: –Disadvantage: we read every record twice. –Advantage: we can use larger buffers and avoid a large number of disk seeks. Calculations: First Merge Step: –Each time read 1/32 of a segment to merge each group of 32 segments => 32*32 = 1024 reads –For 25 32-way merges=> 25 * 1024 = 25,600 reads
26
CENG 35126 For each of 25 segments to read, allocate 1/25th of buffer space only. So, each time from each segment 4000 records (=10MB/25/100 or 1/800 th of a segment) are to be read Hence, 800 reads per segment, –so we end up making 25 * 800 = 20,000 reads. Second Merge Step
27
CENG 35127 Total number of reads for two steps: 25600 + 20000 = 45,600 Compute the total number of writes for two steps? Second Merge Step
28
CENG 35128 Increasing segment Lengths Assume initial segments contain 200,000 records. Then instead of 800-way merge we need 400-way merge. A longer initial segment means –fewer total number of segments, –a lower-order merge, – bigger buffers, – fewer seeks. How can we create initial segments that are twice as large as the number of records that we can hold in memory? => Replacement selection-DISCUSSED LATER
29
CENG 35129 Merging Sorted Segments One disk drive –Merging requires all the sorted segments to be combined into one sorted file. –So, to sort an unordered file requires segmented sorting followed by a merging process to finish the job. –The merge could be k-way, where the k=2,…,m; where m is the number of segments
30
CENG 35130 k-way Merge of Sorted Segments Two-way (k=2) merge –Normally the memory size is limited to the size of a segment, thus two full segments cannot be read into the memory, for the merge process. –So, halves of the two segments read in first, they are merged, while being written out at the same time, the remaining of the exhausted segment is read in, followed by the remaining of the other segment…etc…
31
CENG 35131 2-way Merge of 2 Sorted Segments Assume each segment contain b’ blocks: 2*2*b’*btt+4(s+r)+3(s+r) –the first term includes read and write of two segments, each b’ blocks; second term includes 4 seek and rotational latency for four segment halves; 3 seek and rotational latency for writing the merged segment...
32
CENG 35132 k-way Merge Sort After each merging pass, the number of the sorted segments will be halved. So, if total number of blocks is b and original number of segments is m, logm passes are required, in total. –For large files, where segments become bigger and bigger, the time for one 2-way merge pass can be approximated as: 2*b*btt+2*2*m(r+s) two read and two writes for each original segment, each time requiring 2( s + r) for reading 2 (s+r) for writing, repeating for m segments. –Note that the m is the original number of segments of memory size/
33
CENG 35133 k-way Merge Sort of entire file k-way merge time required for each pass 2*b*btt+2*k*m(r+s) Total time for sorting a pile file using one disk drive, using k-way merge: TT=2*b*btt+(log k m)*[2*k*m*(s+r)+2*b*btt] where the first term is for the creation of sorted segments, the second term is time taken by the log k m k-way merge passes. Note that m is the number of memory size segments.
34
CENG 35134 Which value of k is the best for a given m, r, and s Find the value of the k for which the above formula (total file sort time) is minimum. After replacing log k m by lnm/lnk, and taking the derivative of TT with respect to k, after some simplifications we obtain the formula k*(lnk-1)=(m*btt)/(r+s) –Solving this equation for k, for a fixed set of m, r, and s, should give optimal value… this may be a very rough estimate!
35
CENG 35135 Two Disk Drives or more for merge With two disk drives and large memory merging, under ideal conditions the input and output are overlapped, with proper arrangement of buffering. Can more than two disk drives help improving the performance? –Yes! Overlapping of seek times for different segments on different disk drives
36
CENG 35136 Heap Sort with Replacement Selection Idea –always select the key from memory that has the lowest value –output the key –replacing it with a new key from the input list
37
CENG 35137 Input: 21,67,12, 5, 47, 16 Remaining inputMemory (size P=3) Output segment 21,67,1254716_ 21,671247165 2167471612,5 _67472116,12,5 _6747_21,16,12,5 _67__47, 21,16,12,5 ____67,47, 21,16,12,5 What about a key arriving in memory too late to be output into its proper position? => use of second heap Front of input
38
CENG 35138 Trace of replacement selection Input: 33, 18, 24,58,14,17,7,21,67,12,5,47,16 Assume memory of size P=3
39
CENG 35139 Replacement Selection with two disks Algorithm: 1.Construct a heap (primary heap) in the memory, while reading records block by block from the first disk drive, 2.As we move records from the heap to output buffer, we replace those records with records from the input buffer. If some new records have keys smaller than those already written out, a secondary heap is created for them. The other new records are inserted to the primary heap. 3.Repeat step 2 as long as there are records left in the primary heap and there are records to be read. 4.When the primary heap is empty make the secondary heap into primary heap and repeat steps 1-3.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.