External Sorting Adapt fastest internal-sort methods.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Improve Run Merging Reduce number of merge passes.  Use higher order merge.  Number of passes = ceil(log k (number of initial runs)) where k is the merge.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
External Sorting Access to secondary storage is orders of magnitude slower than memory access. Minimize access to secondary storage (tape or disk).
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
Improve Run Generation Overlap input,output, and internal CPU work. Reduce the number of runs (equivalently, increase average run length). DISK MEMORY.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CPSC 461 Final Review I Hessam Zakerzadeh Dina Said.
Lecture 6 : External Sorting Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
1 B + -Trees: Search  If there are n search-key values in the file,  the path is no longer than  log  f/2  (n)  (worst case).
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Sort n records/elements that reside on a disk.
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
Chapter 12: Query Processing
Lecture 11: DMBS Internals
External Sorting Adapt fastest internal-sort methods.
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
Improve Run Merging Reduce number of merge passes.
Improve Run Generation
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 2- Query Processing (continued)
CS222: Principles of Data Management Lecture #10 External Sorting
Chapter 12 Query Processing (1)
General External Merge Sort
제 7 장 Cosequential Processing and the Sorting of Large Files
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
CENG 351 Data Management and File Structures
External Sorting Adapt fastest internal-sort methods.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
External Sorting Dina Said
Presentation transcript:

External Sorting Adapt fastest internal-sort methods. Quick sort …best average run time. Merge sort … best worst-case run time. Adaptation of quicksort exposed the need for data structures for double-ended priority queues; concept of buffers (portions of memory to read into and write from); and buffer management to reduce time spent waiting for a buffer to empty or fill to/from disk.

Internal Merge Sort Review Phase 1 Create initial sorted segments Natural segments Insertion sort Phase 2 Merge pairs of sorted segments, in merge passes, until only 1 segment remains.

External Merge Sort Sort 10,000 records. Enough memory for 500 records. Block size is 100 records. tIO = time to input/output 1 block (includes seek, latency, and transmission times) tIS = time to internally sort 1 memory load tIM = time to internally merge 1 block load

External Merge Sort Two phases. Run generation. Run merging. A run is a sorted sequence of records. Run merging. Iterative internal merge sort starts with sorted segments; performs merge passes in each of which the number of sorted segments becomes half; stops when we have just one sorted segment. The sorted segments may be created using insertion sort say to get initial segments of size 15 (say).

Run Generation 10,000 records 100 blocks 500 records 5 blocks DISK MEMORY 10,000 records 100 blocks 500 records 5 blocks Input 5 blocks. Sort. Output as a run. Do 20 times. 5tIO tIS 200tIO + 20tIS

Run Merging Merge Pass. Pairwise merge the 20 runs into 10. In a merge pass all runs (except possibly one) are pairwise merged. Perform 4 more merge passes, reducing the number of runs to 1.

Merge 20 Runs S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 T1 T2 T3 T4 T5 U1 U2 U3 V1 V2 W1

Merge R1 and R2 Fill I0 (Input 0) from R1 and I1 from R2. DISK Output Input 0 Input 1 Fill I0 (Input 0) from R1 and I1 from R2. Merge from I0 and I1 to output buffer. Write whenever output buffer full. Read whenever input buffer empty.

Time To Merge R1 and R2 Each is 5 blocks long. Input time = 10tIO. Write/output time = 10tIO. Merge time = 10tIM. Total time = 20tIO + 10tIM .

Time For Pass 1 (R S) Time to merge one pair of runs = 20tIO + 10tIM . Time to merge all 10 pairs of runs = 200tIO + 100tIM .

Time To Merge S1 and S2 Each is 10 blocks long. Input time = 20tIO. Write/output time = 20tIO. Merge time = 20tIM. Total time = 40tIO + 20tIM .

Time For Pass 2 (S T) Time to merge one pair of runs = 40tIO + 20tIM . Time to merge all 5 pairs of runs = 200tIO + 100tIM .

Time For One Merge Pass Time to input all blocks = 100tIO. Time to output all blocks = 100tIO. Time to merge all blocks = 100tIM . Total time for a merge pass = 200tIO + 100tIM . Assumes all runs are paired. When merging an odd number of runs, one run doesn’t go through internal merging but is simply copied from input to output.

Total Run-Merging Time (time for one merge pass) * (number of passes) = (time for one merge pass) * ceil(log2(number of initial runs)) = (200tIO + 100tIM) * ceil(log2(20)) = (200tIO + 100tIM) * 5

Factors In Overall Run Time Run generation. 200tIO + 20tIS Internal sort time. Input and output time. Run merging. (200tIO + 100tIM) * ceil(log2(20)) Internal merge time. Number of initial runs. Merge order (number of merge passes is determined by number of runs and merge order)

Improve Run Generation Overlap input, output, and internal sorting. DISK MEMORY

Improve Run Generation Generate runs whose length (on average) exceeds memory size. Equivalent to reducing number of runs generated.

Improve Run Merging Overlap input, output, and internal merging. DISK MEMORY

Improve Run Merging Reduce number of merge passes. Use higher-order merge. Number of passes = ceil(logk(number of initial runs)) where k is the merge order.

Merge 20 Runs Using 5-Way Merging T1 Number of passes = 2

I/O Time Per Merge Pass Number of input buffers needed is linear in merge order k. Since memory size is fixed, block size decreases as k increases (after a certain k). So, number of blocks increases. So, number of seek and latency delays per pass increases.

I/O Time Per Merge Pass merge order k I/O time per pass Time = c*(seek + latency time) + d * (transmission time) = e*k + f Here, c, d, e, and f are constants.

Total I/O Time To Merge Runs (I/O time for one merge pass) * ceil(logk(number of initial runs)) Total I/O time to merge runs merge order k

Internal Merge Time R1 R2 R3 R4 R5 R6 O Naïve way => k – 1 compares to determine next record to move to the output buffer. Time to merge n records is c(k – 1)n, where c is a constant. Merge time per pass is c(k – 1)n. Total merge time is c(k – 1)nlogkr ~ cn(k/log2k) log2r.

Merge Time Using A Tournament Tree Time to merge n records is dnlog2k, where d is a constant. Merge time per pass is dnlog2k. Total merge time is (dnlog2k) logkr = dnlog2r.