FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Chapter 8 Cosequential Processing and the Sorting of Large Files
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 24 Sorting.
Heapsort By: Steven Huang. What is a Heapsort? Heapsort is a comparison-based sorting algorithm to create a sorted array (or list) Part of the selection.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Chapter 8 Cosequential Processing and the Sorting of Large Files
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Priority Queues1 Part-D1 Priority Queues. Priority Queues2 Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
CSE 373 Data Structures Lecture 15
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CSE AU B-Trees1 B-Trees CSE 373 Data Structures.
The Binary Heap. Binary Heap Looks similar to a binary search tree BUT all the values stored in the subtree rooted at a node are greater than or equal.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CSC 211 Data Structures Lecture 13
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Heapsort Idea: two phases: 1. Construction of the heap 2. Output of the heap For ordering number in an ascending sequence: use a Heap with reverse.
1 Heaps (Priority Queues) You are given a set of items A[1..N] We want to find only the smallest or largest (highest priority) item quickly. Examples:
HEAPS. Review: what are the requirements of the abstract data type: priority queue? Quick removal of item with highest priority (highest or lowest key.
FALL 2005CENG 213 Data Structures1 Priority Queues (Heaps) Reference: Chapter 7.
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Liang, Introduction to Java Programming, Tenth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 23 Sorting.
Database Management System
External Sorting Chapter 13
Chapter 12: Query Processing
Database Management Systems (CS 564)
External Sorting Chapter 13
Ch. 8 Priority Queues And Heaps
Lecture 2- Query Processing (continued)
Chapter 12 Query Processing (1)
제 7 장 Cosequential Processing and the Sorting of Large Files
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
External Sorting Chapter 13
Presentation transcript:

FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8

FALL 2004CENG 351 Data Management and File Structures2 Outline Heapsort Multi-way Merging Multi-step merging Replacement Selection

FALL 2004CENG 351 Data Management and File Structures3 External Sorting Problem: Sort 1Gb of data with 1Mb of RAM. When a file doesn’t fit in memory, there are two stages in sorting: 1.File is divided into several segments, each of which sorted separately 2.Sorted segments are merged (Each stage involves reading and writing the file at least once)

FALL 2004CENG 351 Data Management and File Structures4 Sorting Segments 1.Heapsort: optimal routine if only one disk drive is available. It can be executed by overlapping the input/output with processing Each sorted segment will be the size of the available memory. 2.Replacement selection: optimal for two or more disk drives. Sorted segments are twice the size of memory. Reading in and writing out can be overlapped

FALL 2004CENG 351 Data Management and File Structures5 Heapsort What is a heap? A heap is a binary tree with the following properties: 1.Each node has a single key and that key is greater than or equal to the key at its parent node. 2.It is a complete binary tree. i.e. All leaves are on at most 2 levels, leaves on the lowest level are at the leftmost position. 3.Can be stored in an array; the root is at index 1, the children of node i are at indexes 2*i, and 2*i+1. Conversely, the parent of node j is stored at index  j/2  (very compact: no need to store pointers)

FALL 2004CENG 351 Data Management and File Structures Heap as a binary tree: Height =  log n  Example Heap as an array:

FALL 2004CENG 351 Data Management and File Structures7 Heapsort Algorithm First Stage: Building the heap while reading the file: –While there is available space Get the next record from current input buffer Put the new record at the end of the heap Reestablish the heap by exchanging the new node with its parent, if it is smaller than the parent: otherwise leave it, where it should be. Repeat this step as long as heap property is violated. Second stage: Sorting while writing the heap out to the file: –While there are records in heap Put the root record in the current output buffer. Replace the root by the last record in the heap. Restore the heap again, which has the complexity of O(log n)

FALL 2004CENG 351 Data Management and File Structures8 Example Trace the algorithm with:

FALL 2004CENG 351 Data Management and File Structures9 Heapsort How big is a heap? –As big as the available memory. What is the time it takes to create the sorted segments? –Ignoring the seek time and assuming b blocks in the file, where heap processing overlaps (approximately) with I/O. –The time for creating the initial sorted segments is 2b*ebt (read in the segment and write out the runs) –Note that the entire file has not been sorted yet. These are just sorted segments, and the size of each segment is limited to the size of the available memory used for this purpose.

FALL 2004CENG 351 Data Management and File Structures10 Merging Two Lists int Merge (char* L1Name, char* L2Name, char* OLName) { InitializeList (1, L1Name); InitializeList (2, L2Name); InitOutputList (OLName); bool More1 = NextItem (1); bool More2 = NextItem (2); while (More1 || More2){ if (Item(1) < Item(2)){ ProcessItem(1); More1 = NextItem (1); } else if (Item(1) == Item(2)){ ProcessItem(1); More1 = NextItem (1); More2 = NextItem (2); } else { ProcessItem(2); More2 = NextItem (2); } FinishUp ( ); return 1; }

FALL 2004CENG 351 Data Management and File Structures11 Multiway Merging K-way merge: we want to merge K input lists to create a single sequentially ordered output list. (K is the order of a K-way merge) We will adapt the 2-way merge algorithm: –Instead of two lists, keep an array of lists: list[0], list[1], … list[k-1] –Keep an array of the items that are being used from each list: item[0], item[1], … item[k-1] –The merge processing requires a call to a function (say MinIndex) to find the index of the item with the minimum value.

FALL 2004CENG 351 Data Management and File Structures12 Merge Processing We modify the main loop of the merge as follows: int minItem = MinIndex(Item, k); ProcessItem(minItem); // next output for (i=0; i < k; i++) if (Item(minItem) == Item(i)) MoreItems[i]= NextItemInList(i); If there are no duplicate items among different lists, then the for-loop can be eliminated.

FALL 2004CENG 351 Data Management and File Structures13 Finding the minimum item When the number of lists is small (K  8) sequential search among items works nicely. (O(K)) When the number of lists is large, we could place the items in a priority queue (an array heap). The min value will be at the root (1 st position in array) Replace the root with the next value from the associated list. This insert operation is O(log K)

FALL 2004CENG 351 Data Management and File Structures14 Merging as a way of Sorting Large Files Let us consider the following example: File to be sorted: –8,000,000 records –R = 100 bytes –Size of the key = 10 bytes Memory available as a work area: 10MB (not counting memory used to hold program, O.S., I/O buffers etc.)  Total file size = 800MB  Total number of bytes for all keys = 80MB So, we cannot do internal sorting nor keysorting.

FALL 2004CENG 351 Data Management and File Structures15 Basic idea 1.Forming runs (i.e. sorted subfiles): bring as many records as possible to main memory, sort them using heapsort, save it into a small file. Repeat this until we have read all records from the original file. 2.Do a multiway merge of the sorted subfiles.

FALL 2004CENG 351 Data Management and File Structures16 Cost of Merge Sort I/O operations are performed in the following times: 1.Reading each record into main memory for sorting and forming the runs. 2.Writing sorted runs to disk. These two steps are done as follows: –Read a chunk of 10MB, write a chunk of 10Mb (repeat this 80 times) –In terms of basic disk operations, we spend: For reading: 80 seeks + transfer time for 800 MB Same for writing

FALL 2004CENG 351 Data Management and File Structures17 3.Reading runs into memory for merging. Read one chunk of each run, so 80 chunks. Since available memory is 10MB each chunk can have (10,000,000/80)bytes = 125,000 bytes = 1250 records. How many chunks to be read for each run? Size of run/size of chunk = 10,000,000/125,000= 80 Total number of basic seeks = Total number of chunks (counting all runs) is 80 runs * 80 chunks/run = 80 2 chunks = 6400 seeks. Reading each chunk involves average seeking.

FALL 2004CENG 351 Data Management and File Structures18 4.Writing sorted file to disk: the number of seeks depends on the size of output buffer: Bytes in file/bytes in output buffer e.g. if output buffer is 200K, the number of seeks is 800,000,000/200,000 = 4,000 seeks Among steps 1-4, step 3 dominates the running time.

FALL 2004CENG 351 Data Management and File Structures19 Sorting a File that is 10 times larger How is the time for merge phase affected if the file is 80 million records? –More runs: 800 runs –800-way merge in 10MB memory –i.e. divide the memory into 800 buffers. –Each buffer holds 1/800 th of a run –So, 800 runs * 800 seeks/run = 640,000 seeks

FALL 2004CENG 351 Data Management and File Structures20 The cost of increasing the file size In general, for a K-way merge of K runs, the buffer size for each run is –(1/K) * size of memory space = (1/K) * size of each run So K seeks are required to read all of the records in each run. Since there are K runs, merge requires K 2 seeks. Because K is directly proportional to N it also follows that the sort merge is an O(N 2 ) operation.

FALL 2004CENG 351 Data Management and File Structures21 Improvements There are several ways to reduce the time: 1.Allocate more hardware (e.g. Disk drives, memory) 2.Perform merge in more than one step. 3.Algorithmically increase the lengths of the initial sorted runs 4.Find ways to overlap I/O operations.

FALL 2004CENG 351 Data Management and File Structures22 Multiple-step merges Instead of merging all runs at once, we break the original set of runs into small groups and merge the runs in these groups separately. –more buffer space is available for each run; hence fewer seeks are required per run. When all of the smaller merges are completed, a second pass merges the new set of merged runs.

FALL 2004CENG 351 Data Management and File Structures23 ………… … … 25 sets of 32 runs each Two-step merge of 800 runs

FALL 2004CENG 351 Data Management and File Structures24 Cost of multi-step merge 25 sets of 32 runs, followed by 25-way merge: –Disadvantage: we read every record twice. –Advantage: we can use larger buffers and avoid a large number of disk seeks. Calculations: First Merge Step: –Buffer size = 1/32 run => 32*32 = 1024 seeks –For way merges=> 25 * 1024 = 25,600 seeks

FALL 2004CENG 351 Data Management and File Structures25 Second Merge Step: –For each 25 final runs, 1/25 buffer space is allocated. –So each input buffer can hold 4000 records (or 1/800 run) –Hence, 800 seeks per run, so we end up making 25 * 800 = 20,000 seeks. Total number of seeks for two steps: = 45,600 What about the total time for merge? –We now have to transmit all of the records 4 times instead of two. –We also write the records twice, requiring an extra 40000seeks. Still the trade is profitable (see sections for actual times)

FALL 2004CENG 351 Data Management and File Structures26 Increasing Run Lengths Assume initial runs contain records.Then instead of 800-way merge we need 400-way merge. A longer initial run means –fewer total runs, –a lower-order merge, – bigger buffers, – fewer seeks. How can we create initial runs that are twice as large as the number of records that we can hold in memory? => Replacement selection

FALL 2004CENG 351 Data Management and File Structures27 Replacement Selection Idea –always select the key from memory that has the lowest value –output the key –replacing it with a new key from the input list

FALL 2004CENG 351 Data Management and File Structures28 Input: 21,67,12, 5, 47, 16 Remaining inputMemory (P=3) Output run 21,67, _ 21, ,5 _ ,12,5 _6747_21,16,12,5 _67__47, 21,16,12,5 ____67,47, 21,16,12,5 What about a key arriving in memory too late to be output into its proper position? => use of second heap Front of input

FALL 2004CENG 351 Data Management and File Structures29 Trace of replacement selection Input:( P = 3) 33, 18, 24,58,14,17,7,21,67,12,5,47,16

FALL 2004CENG 351 Data Management and File Structures30 Replacement Selection with two disks Algorithm: 1.Construct a heap (primary heap) in the memory, while reading records block by block from the first disk drive, 2.As we move records from the heap to output buffer, we replace those records with records from the input buffer. If some new records have keys smaller than those already written out, a secondary heap is created for them. The other new records are inserted to the primary heap. 3.Repeat step 2 as long as there are records left in the primary heap and there are records to be read. 4.When the primary heap is empty make the secondary heap into primary heap and repeat steps 1-3.