제 7 장 Cosequential Processing and the Sorting of Large Files

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
February 15 & 171 Csci 2111: Data and File Structures Week 6, Lectures 1 & 2 Cosequential Processing and the Sorting of Large Files.
Chapter 8 Cosequential Processing and the Sorting of Large Files
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Chapter 8 Cosequential Processing and the Sorting of Large Files
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Chap 8. Cosequential Processing and the Sorting of Large Files
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
External Sorting Adapt fastest internal-sort methods.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Prof. U V THETE Dept. of Computer Science YMA
Subject Name: File Structures
May 17th – Comparison Sorts
Multiway Search Trees Data may not fit into main memory
Database Management System
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Chapter 13
Subject Name: File Structures
Chapter 12: Query Processing
Data Structures Recursion CIS265/506: Chapter 06 - Recursion.
Lecture 11: DMBS Internals
Chapter 15 QUERY EXECUTION.
External Sorting Adapt fastest internal-sort methods.
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
Chapter 9 Structuring System Requirements: Logic Modeling
Lecture 7: Index Construction
External Sorting Chapter 13
Improve Run Generation
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 2- Query Processing (continued)
File Storage and Indexing
Simple Sorting Methods: Bubble, Selection, Insertion, Shell
Indexing Lecture 15.
CS222: Principles of Data Management Lecture #10 External Sorting
Chapter 12 Query Processing (1)
CS505: Intermediate Topics in Database Systems
CSE 373 Data Structures and Algorithms
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Chapter 9 Structuring System Requirements: Logic Modeling
CS222P: Principles of Data Management Lecture #10 External Sorting
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
External Sorting Adapt fastest internal-sort methods.
Database Systems (資料庫系統)
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
CSE 190D Database System Implementation
Presentation transcript:

제 7 장 Cosequential Processing and the Sorting of Large Files

Cosequential Operations Coordinated processing of two or more sequential lists to produce a single output list Kinds of Operations merging, union matching intersection combination of above 2019-04-16 K.O. Lee

Matching Operation Output the names common to the two lists Matching or an intersection Four step 1. Initializing 2. Synchronizing 3. Handling end-of-file conditions 4. Recognizing errors 2019-04-16 K.O. Lee

Matching Operation (2) Algorithm p261 Figure 7.2 three-way conditional statement if NAME_1 < NAME_2 read the next from LIST_1 if NAME_1 > NAME_2 read the next from LIST_2 else output the name read the next from both list 2019-04-16 K.O. Lee

Matching Operation (3) Key of algorithm End-of-file condition always return to the head of the main loop End-of-file condition test MORE_NAMES_EXIST flag until either of two list reaches end-of-file 2019-04-16 K.O. Lee

Merging Two Lists Based on matching operation p264 Figure 7.5 Difference must read each of the lists completely change MORE_NAMES_EXIST behavior HIGH_VALUE comes after all legal input values in the file’s ordered sequence 2019-04-16 K.O. Lee

Cosequential Processing Model Assumptions Two or more input files are processed in a parallel fashion Each file is sorted Comments Output may be the same as one of the input files Not necessary that all files have the same record structures 2019-04-16 K.O. Lee

Cosequential Processing Model (2) Assumptions must exist a high key and a low key value records are in logical sorted order Comments not necessary, but decreases complexity physical ordering can have a large impact on processing 2019-04-16 K.O. Lee

Cosequential Processing Model (3) Assumptions for each file, only one current record records should be manipulated only in internal memory Comments not prohibits looking ahead or looking back, but such operations should be restricted to subprocedures cannot alter a record 2019-04-16 K.O. Lee

Cosequential Processing Model (4) Components Initialization read from the first record in the files Synchronization loop as long as relevant records remain Selection in main synchronization loop Use high values as end-of-file condition no special code to deal with end-of-file 2019-04-16 K.O. Lee

Cosequential Processing Model (5) Components - cont’d I/O and error detection are to be relegated to subprocesses hide details Simple and robust Example: General Ledger Program pp. 268~276 2019-04-16 K.O. Lee

Multiway Merging K-way merge merge K input lists to create a single, ordered output list p277 Figure 7.16 less then 8 or so 2019-04-16 K.O. Lee

Multiway Merging (2) Selection Tree K-way merge set of comparisons becomes expensive time vs space trade-off a kind of tournament tree each higher-level node represents the winner of the two descendent keys the depth of tree is log2 K 2019-04-16 K.O. Lee

Selection Tree 2019-04-16 K.O. Lee

Sorting in RAM Can we improve on the time of RAM sort? Heapsort perform some of parts in parallel selection tree is good but cannot used to sort entire file Heapsort sorting and reading can occur in parallel keeping all of the keys in heap 2019-04-16 K.O. Lee

Heapsort Heap Processing overlap with I/O 자식 노드는 부모노드보다 크거나 같다. 노드 i의 자식 노드는 2i와 2i+1 Fig 7.20, Fig 7.21 Processing overlap with I/O use more than one buffer p284 Figure 7.22 fill buffer while building heap Procedure for outputting : Fig 7.23 2019-04-16 K.O. Lee

Sorting Large Files on Disk Keysort shortcomings cost of seeking cannot sort really large file all key/pointer pairs in RAM Multiway merge algorithm run: sorted subfile 2019-04-16 K.O. Lee

Sorting Large Files on Disk (2) 2019-04-16 K.O. Lee

Sorting Large Files on Disk (3) Multiway merging can be extended to files of any size reading during the run creation step no seeking due to sequential reading reading and writing during merging sequential I/O overlap using heapsorting tape can be used 2019-04-16 K.O. Lee

How Much Time Does a Merge Sort Take? Merge Sort vs Key Sort pp. 287~290 (10분대 5시간) 4 Steps reading records and forming runs writing sorted runs reading sorted runs for merging writing sorted file 2019-04-16 K.O. Lee

Sorting a Very Large File Kinds of I/O sort phase sequential if using heapsort no improvement merge phase random access(run의 개수에 비례) Ways to improve performance cut down the number of random access in the merge phase 2019-04-16 K.O. Lee

Cost of Increasing the File Size For a K-way merge of K runs, the buffer size for each of the runs 1/K * size of RAM = 1/K * size of each run merge operation requires K2 seeks Merge sort is O(K2) operation 2019-04-16 K.O. Lee

Cost of Increasing the File Size (2) Ways to reduce time more hardware merge more than one step reducing the order of each merge increasing the buffer size for each run Increase the length of the initial sorted runs Overlap I/O operations 2019-04-16 K.O. Lee

Hardware-based Improvements Possible configuration increasing the amount of RAM increasing the number of disk drives increasing the number of I/O channels 2019-04-16 K.O. Lee

Multiple-Step Merging Break the original set of runs into small groups and merge the runs in these groups separately Fewer seeks, but extra transmission time in second pass Read every record twice to form the intermediate runs and to form the final sorted file 2019-04-16 K.O. Lee

Multiple-Step Merging (2) Essence of multiple-step merging increase the available buffer space for run extra pass vs random access decreasing More than two steps? reduced seek and rotational times vs transmission times 2019-04-16 K.O. Lee

Increasing Run Lengths A longer initial run fewer total runs bigger buffers fewer seeks Replacement selection 2019-04-16 K.O. Lee

Replacement Selection Idea aways select the key from memory that has the lowest value output the key replacing it with a new key from the input list Implementation: p299 p300 Figure 7.27 2019-04-16 K.O. Lee

Replacement Seletion (2) What about a key arriving in memory too late to be output into its proper position? use of second heap p301 Figure 7.28 2019-04-16 K.O. Lee

Replacement Selection (4) Two questions Given P locations in memory, how long a run can we expect replacement selection to produce, on the average? pp. 301~302 What are the costs of using replacement selection? pp. 303~304 less than 1/3 as many seeks as RAM sorting 2019-04-16 K.O. Lee