CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Chapter 8 Cosequential Processing and the Sorting of Large Files
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 24 Sorting.
Heapsort By: Steven Huang. What is a Heapsort? Heapsort is a comparison-based sorting algorithm to create a sorted array (or list) Part of the selection.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
E.G.M. Petrakissorting1 Sorting  Put data in order based on primary key  Many methods  Internal sorting:  data in arrays in main memory  External.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Chapter 8 Cosequential Processing and the Sorting of Large Files
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
3-Sorting-Intro-Heapsort1 Sorting Dan Barrish-Flood.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Binary Heap.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Simple Iterative Sorting Sorting as a means to study data structures and algorithms Historical notes Swapping records Swapping pointers to records Description,
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
1 Heaps (Priority Queues) You are given a set of items A[1..N] We want to find only the smallest or largest (highest priority) item quickly. Examples:
Heaps & Priority Queues
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
FALL 2005CENG 213 Data Structures1 Priority Queues (Heaps) Reference: Chapter 7.
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 25 Sorting.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Liang, Introduction to Java Programming, Tenth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 23 Sorting.
Heaps, Heap Sort, and Priority Queues. Background: Binary Trees * Has a root at the topmost level * Each node has zero, one or two children * A node that.
1 Priority Queues (Heaps). 2 Priority Queues Many applications require that we process records with keys in order, but not necessarily in full sorted.
Chapter 23 Sorting Jung Soo (Sue) Lim Cal State LA.
External Sorting Chapter 13
Chapter 12: Query Processing
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Database Management Systems (CS 564)
External Sorting Chapter 13
B+-Trees (Part 1).
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Lecture 2- Query Processing (continued)
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
제 7 장 Cosequential Processing and the Sorting of Large Files
CENG 351 Data Management and File Structures
Priority Queues (Heaps)
External Sorting Chapter 13
Presentation transcript:

CENG 3511 External Sorting

CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort

CENG 3513 External Sort External sorting is used when the file to be sorted is too big to be in the main memory in full. In this case the file is sorted using hard disk storage as well as main memory. External sorting can be done using one disk or more than one, depending on the size of the file and the availability of hard disk drives. What sort algorithm is suitable for the external sort? The algorithm that allows large files to be sorted! Heap sort seems to be appropriate for external sort, with average complexity of O(nlogn) Quick sort is appropriate for in memory sort, with avrage complexity of O(nlogn).

CENG 3514 Why Heap sort Large files are handled easily: –First phase involves sorting the large file in memory size chunks (file segments) –Second phase involves merging the sorted segments, on the disk, each time creating a twice as large sorted segments –Also overlapping of processing with I/O is possible with the Heap Sort… –In two disk drive case Input and output can even be overlapped as well, with further advantage…

CENG 3515 Sorting Segments 1.Producing sorted segments of memory sıze each time Heap sort is also known as to be based on a “priority sort queue”. It can be executed by overlapping the input/output with processing. Each sorted segment can be the size of the available memory. 2.Producing a single sorted segment of twice as much as the memory size (known as replacement selection) Sorted segments are twice the size of memory. Reading in and writing out can be overlapped

CENG 3516 Heapsort Algorithm What is a heap? A heap is a binary tree with the following properties: 1.Each node has a single key and that key is greater than or equal to the key at its parent node. 2.It is a complete binary tree. i.e. All leaves are on at most 2 levels, leaves on the lowest level are at the leftmost position. 3.Can be stored in an array; the root is at index 1, the children of node i are at indexes 2*i, and 2*i+1. Conversely, the parent of node j is stored at index  j/2  (very compact: no need to store pointers)

CENG 3517 How big is a heap? As big as the largest heap that can fit in memory, called one segment What is the complexity of the heap sort? –Each time a new record is added to the tree it requires a maximum of logn comparisons to reestablish the heap, if n is the number of the nodes in the tree at that moment. For example, for a tree of 1024 records, logn will be 10.

CENG 3518 How big is a heap? Overlapping of block input and in memory heap processing is normal: If a block contains 10 records; while the next block is being read, 100 (=10*10) comparison can be conducted, for a tree of 1024 nodes. So, larger block, better concurrency… How to conduct the second stage of the heap sort, i.e., writing the segment out to the disk? –Write the root of the heap established during the first stage, to the output buffer –Replace the root by the last record in the tree –Reestablish the heap, which has the complexity of O(logn)

CENG Heap as a binary tree: Height =  log n  Example Heap as an array:

CENG Heapsort Algorithm First Stage: Building the heap while reading the file: –While there is available space Get the next record from current input buffer Put the new record at the end of the heap Reestablish the heap by exchanging the new node with its parent, if it is smaller than the parent: otherwise leave it, where it should be. Repeat this step as long as heap property is violated.

CENG Heapsort Algorithm Second stage: Sorting while writing the heap out to the file: –While there are records in heap Put the root record (smallest key record) in the current output buffer. Replace the root by the last record in the heap. Restore the heap again, which has the complexity of O(logn) and repeat this procedure until the heap is empty, –so overall complexity is O(nlogn).

CENG Example Trace the algorithm with the following input: First establish heap, then write-out in descending order…

CENG Sort of large files! Large files can be sorted in two ways –Use of keysort when the all the keys can fit into the main memory, with pointers to the actual location of the records on the disk: This is inefficient for record access and also does not work when the memory is not enough even for the keys. This method is not discussed further because of its limitations. –Use MERGE operation: This method can work for a memory of any size…

CENG Merge of sorted lists

CENG Multiway Merging K-way merge: we wat to merge K sorted input lists to create a sorted single output list. (K is called order of a K-way merge) K-way merge algorithm: –Keep an array of lists: list[0], list[1], … list[k-1] –Keep an array of the items that are being used from each list: item[0], item[1], … item[k-1] –The merge processing requires a call to a function (say MinIndex) to find the index of the item with the minimum of k input items.

CENG Merging as a way of Sorting Large Files Let us consider the following example: File to be sorted: –8,000,000 records –Record size= 100 bytes –Size of the key = 10 bytes Memory available as a work area: 10MB (not counting memory used to hold programs, O.S., I/O buffers etc.)  Total file size = 800MB  Total number of bytes for all keys = 80MB So, we cannot do internal sorting nor key-sorting.

CENG Basic idea 1.Forming segments (i.e. sorted sub-files): bring as many records as possible to the main memory, sort them using heapsort, save the sorted data into a file. Repeat this until we have read all the records from the original file. As a result a number of sorted files will be formed… 2.Do a multi-way merge on the sorted sub- files to form larger sorted files.

CENG Cost of Merge Sort-1 I/O operations are performed in the following times: 1.Reading each block of records into main memory and sort them using heap sort. 2.Writing sorted segments to disk. These two steps are done as follows: –Read a chunk of 10MB, write a chunk of 10Mb (repeat this 80 times for the given file of 800MB) –In terms of basic disk operations, we spend: For reading: 80 (s+r) + 80*btt Same for writing

CENG Reading segments into memory for merging. Read one chunk of each segment, so 80 chunks. Since available memory is 10MB each chunk can have (10,000,000/80)bytes = 125,000 bytes = 1250 records. Total number of s and r = Total number of segments * total number chunks per segmemts = 80*80=80 2 chunks Which means 6400 (s + r). Cost of Merge Sort-2

CENG Cost of Merge Sort-3: For very Large Files! How is the time for merge phase affected if the file is 80 million records? –Number of segments: 800 segments –800-way merge to use 10MB memory i.e. divide the memory into 800 buffers. –Each buffer holds 1/800 th of a segment –So, 800 segments * 800 reads/segment = 640,000 reads.

CENG The cost of increasing the file size In general, for a K-way merge of K segments, the buffer size for each segment is –(1/K) of the size of memory space = (1/K) of the size of each segment So K reads are required for each segment. Since there are K segments, merge requires K 2 seeks. Because K is directly proportional to N it also follows that the sort merge complexity is of order O(N 2 ) operation.

CENG Improvements There are several ways to reduce the time: 1.Allocate more hardware (e.g. Disk drives, memory) 2.Perform merge in more than one step, where K is less than the total number of sorted segments 3.Algorithmically increase the lengths of the initial sorted segments 4.Find ways to overlap I/O operations.

CENG Multiple-step merges Instead of merging all segments at once, we break the original set of segments into small groups and merge the segments in these groups separately. –more buffer space is available for each segment; hence fewer seeks are required per segment. When all of the smaller merges are completed, a second pass merges the larger segments.

CENG ………… … … Example: 25 sets of 32 segments each Two-step merge of 800 segments

CENG Example: Cost of multi-step merge 25 sets of 32 segments, followed by 25-way merge: –Disadvantage: we read every record twice. –Advantage: we can use larger buffers and avoid a large number of disk seeks. Calculations: First Merge Step: –Each time read 1/32 of a segment to merge each group of 32 segments => 32*32 = 1024 reads –For way merges=> 25 * 1024 = 25,600 reads

CENG For each of 25 segments to read, allocate 1/25th of buffer space only. So, each time from each segment 4000 records (=10MB/25/100 or 1/800 th of a segment) are to be read Hence, 800 reads per segment, –so we end up making 25 * 800 = 20,000 reads. Second Merge Step

CENG Total number of reads for two steps: = 45,600 Compute the total number of writes for two steps? Second Merge Step

CENG Increasing segment Lengths Assume initial segments contain 200,000 records. Then instead of 800-way merge we need 400-way merge. A longer initial segment means –fewer total number of segments, –a lower-order merge, – bigger buffers, – fewer seeks. How can we create initial segments that are twice as large as the number of records that we can hold in memory? => Replacement selection-DISCUSSED LATER

CENG Merging Sorted Segments One disk drive –Merging requires all the sorted segments to be combined into one sorted file. –So, to sort an unordered file requires segmented sorting followed by a merging process to finish the job. –The merge could be k-way, where the k=2,…,m; where m is the number of segments

CENG k-way Merge of Sorted Segments Two-way (k=2) merge –Normally the memory size is limited to the size of a segment, thus two full segments cannot be read into the memory, for the merge process. –So, halves of the two segments read in first, they are merged, while being written out at the same time, the remaining of the exhausted segment is read in, followed by the remaining of the other segment…etc…

CENG way Merge of 2 Sorted Segments Assume each segment contain b’ blocks: 2*2*b’*btt+4(s+r)+3(s+r) –the first term includes read and write of two segments, each b’ blocks; second term includes 4 seek and rotational latency for four segment halves; 3 seek and rotational latency for writing the merged segment...

CENG k-way Merge Sort After each merging pass, the number of the sorted segments will be halved. So, if total number of blocks is b and original number of segments is m, logm passes are required, in total. –For large files, where segments become bigger and bigger, the time for one 2-way merge pass can be approximated as: 2*b*btt+2*2*m(r+s) two read and two writes for each original segment, each time requiring 2( s + r) for reading 2 (s+r) for writing, repeating for m segments. –Note that the m is the original number of segments of memory size/

CENG k-way Merge Sort of entire file k-way merge time required for each pass 2*b*btt+2*k*m(r+s) Total time for sorting a pile file using one disk drive, using k-way merge: TT=2*b*btt+(log k m)*[2*k*m*(s+r)+2*b*btt] where the first term is for the creation of sorted segments, the second term is time taken by the log k m k-way merge passes. Note that m is the number of memory size segments.

CENG Which value of k is the best for a given m, r, and s Find the value of the k for which the above formula (total file sort time) is minimum. After replacing log k m by lnm/lnk, and taking the derivative of TT with respect to k, after some simplifications we obtain the formula k*(lnk-1)=(m*btt)/(r+s) –Solving this equation for k, for a fixed set of m, r, and s, should give optimal value… this may be a very rough estimate!

CENG Two Disk Drives or more for merge With two disk drives and large memory merging, under ideal conditions the input and output are overlapped, with proper arrangement of buffering. Can more than two disk drives help improving the performance? –Yes! Overlapping of seek times for different segments on different disk drives

CENG Heap Sort with Replacement Selection Idea –always select the key from memory that has the lowest value –output the key –replacing it with a new key from the input list

CENG Input: 21,67,12, 5, 47, 16 Remaining inputMemory (size P=3) Output segment 21,67, _ 21, ,5 _ ,12,5 _6747_21,16,12,5 _67__47, 21,16,12,5 ____67,47, 21,16,12,5 What about a key arriving in memory too late to be output into its proper position? => use of second heap Front of input

CENG Trace of replacement selection Input: 33, 18, 24,58,14,17,7,21,67,12,5,47,16 Assume memory of size P=3

CENG Replacement Selection with two disks Algorithm: 1.Construct a heap (primary heap) in the memory, while reading records block by block from the first disk drive, 2.As we move records from the heap to output buffer, we replace those records with records from the input buffer. If some new records have keys smaller than those already written out, a secondary heap is created for them. The other new records are inserted to the primary heap. 3.Repeat step 2 as long as there are records left in the primary heap and there are records to be read. 4.When the primary heap is empty make the secondary heap into primary heap and repeat steps 1-3.