Improve Run Merging Reduce number of merge passes.  Use higher order merge.  Number of passes = ceil(log k (number of initial runs)) where k is the merge.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Join Processing in Databases Systems with Large Main Memories
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
CS 257 Database Systems Principles Assignment 2 Instructor: Student: Dr. T. Y. Lin Rajan Vyas (119)
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
6/27/20151 PSU’s CS External Sorting  Motivation  2-way External Sort: Memory, passes,cost  General External Sort: Memory, passes, cost  Optimizations.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Improve Run Generation Overlap input,output, and internal CPU work. Reduce the number of runs (equivalently, increase average run length). DISK MEMORY.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Duality between Reading and Writing with Applications to Sorting Jeff Vitter Department of Computer Science Center for Geometric & Biological Computing.
6 Memory Management and Processor Management Management of Resources Measure of Effectiveness – On most modern computers, the operating system serves.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
Lecture 6 : External Sorting Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
External Sorting Adapt fastest internal-sort methods.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
4P13 Week 9 Talking Points
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
The Teacher Computing Scheduling [CP5]. The Teacher Computing Multiprogramming A number of jobs are queued and loaded into memory to be processed. A number.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
External Sorting Sort n records/elements that reside on a disk.
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
External Sorting Chapter 13
External Sorting Adapt fastest internal-sort methods.
Improve Run Merging Reduce number of merge passes.
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
Improve Run Generation
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 2- Query Processing (continued)
CS222: Principles of Data Management Lecture #10 External Sorting
External Sorting.
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CS222P: Principles of Data Management Lecture #10 External Sorting
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
External Sorting Adapt fastest internal-sort methods.
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Presentation transcript:

Improve Run Merging Reduce number of merge passes.  Use higher order merge.  Number of passes = ceil(log k (number of initial runs)) where k is the merge order. More generally, a higher-order merge reduces the cost of the optimal merge tree.

Improve Run Merging Overlap input, output, and internal merging. DISK MEMORY DISK

Steady State Operation Read from disk Write to disk Merge

DISK MEMORY DISK Partitioning Of Memory Need exactly 2 output buffers. I1I0O0O1 Loser Tree …Ib Need at least k+1 (k is merge order) input buffers. 2k input buffers suffice.

Number Of Input Buffers When 2 input buffers are dedicated to each of the k runs being merged, 2k buffers are not enough! Input buffers must be allocated to runs on an as needed basis.

Buffer Allocation When ready to read a buffer load, determine which run will exhaust first.  Examine key of the last record read from each of the k runs.  Run with smallest last key read will exhaust first.  Use an enforceable tie breaker. Next buffer load of input is to come from run that will exhaust first, allocate an input buffer to this run.

Buffer Layout Output buffers Input buffer queues k=9 F0F1F2F3F4F5F6F7F8 R0R1R2R3R4R5R6R7R8 Pool of free input buffers

Initialize To Merge k Runs Initialize k queues of input buffers, 1 queue per run, 1 buffer per run. Input one buffer load from each of the k runs. Put k – 1 unused input buffers into pool of free buffers. Set activeOutputBuffer = 0. Initiate input of next buffer load from first run to exhaust. Use remaining unused input buffer for this input.

The Method kWayMerge k-way merge from input queues to the active output buffer. Merge stops when either the output buffer gets full or when an end-of-run key is merged into the output buffer. If merge hasn’t stopped and an input buffer gets empty, advance to next buffer in queue and free empty buffer.

Merge k Runs repeat kWayMerge; wait for input/output to complete; add new input buffer (if any) to queue for its run; determine run that will exhaust first; if (there is more input from this run) initiate read of next block for this run; initiate write of active output buffer; activeOutputBuffer = 1 – activeOutputBuffer; until end-of-run key merged;

What Can Go Wrong? k-way merge from input queues to the active output buffer. Merge stops when either the output buffer gets full or when an end-of-run key is merged into the output buffer. If merge hasn’t stopped and an input buffer gets empty, advance to next buffer in queue and free empty buffer. There may be no next buffer in the queue.

What Can Go Wrong? repeat kWayMerge; wait for input/output to complete; add new input buffer (if any) to queue for its run; determine run that will exhaust first; if (there is more input from this run) initiate read of next block for this run; initiate write of active output buffer; activeOutputBuffer = 1 – activeOutputBuffer; until end of run key merged; There may be no free input buffer to read into.

If merge hasn’t stopped and an input buffer gets empty, advance to next buffer in queue and free empty buffer. There may be no next buffer in the queue. If this type of failure were to happen, using two different and valid analyses, we will end up with inconsistent counts of the amount of data available to kWayMerge. Data available to kWayMerge is data in  Input buffer queues.  Active output buffer.  Excludes data in buffer being read or written.

No Next Buffer In Queue repeat kWayMerge; wait for input/output to complete; add new input buffer (if any) to queue for its run; determine run that will exhaust first; if (there is more input from this run) initiate read of next block for this run; initiate write of active output buffer; activeOutputBuffer = 1 – activeOutputBuffer; until end-of-run key merged; Exactly k buffer loads available to kWayMerge.

If merge hasn’t stopped and an input buffer gets empty, advance to next buffer in queue and free empty buffer. There may be no next buffer in the queue. Alternative analysis of data available to kWayMerge at time of failure.  < 1 buffer load in active output buffer  <= k – 1 buffer loads in remaining k – 1 queues  Total data available to k-way merge is < k buffer loads.

Suppose there is no free input buffer. One analysis will show there are exactly k + 1 buffer loads in memory (including newly read input buffer) at time of failure. Another analysis will show there are > k + 1 buffer loads in memory at time of failure. Note that at time of failure there is no buffer being read or written. There may be no free input buffer to read into. initiate read of next block for this run;

No Free Input Buffer repeat kWayMerge; wait for input/output to complete; add new input buffer (if any) to queue for its run; determine run that will exhaust first; if (there is more input from this run) initiate read of next block for this run; initiate write of active output buffer; activeOutputBuffer = 1 – activeOutputBuffer; until end-of-run key merged; Exactly k + 1 buffer loads in memory.

Alternative analysis of data in memory.  1 buffer load in the active output buffer.  1 input queue may have an empty first buffer.  Remaining k – 1 input queues have a nonempty first buffer.  Remaining k input buffers must be in queues and full.  Since k > 1, total data in memory is > k + 1 buffer loads. There may be no free input buffer to read into. initiate read of next block for this run;

Minimize Wait Time For I/O To Complete Time to fill an output buffer ~ time to read a buffer load ~ time to write a buffer load

Initializing For Next k-way Merge Change if (there is more input from this run) initiate read of next block for this run; to if (there is more input from this run) initiate read of next block for this run; else initiate read of a block for the next k-way merge;