Lecture#12: External Sorting (R&G, Ch13)

Slides:



Advertisements
Similar presentations
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#12: External Sorting.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
External Sorting R & G Chapter 13 One of the advantages of being
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
1 Database Systems ( 資料庫系統 ) December 7, 2011 Lecture #11.
ZA classic problem in computer science! zData requested in sorted order ye.g., find students in increasing cap order zSorting is used in many applications.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Database System Architecture and Implementation
Database Applications (15-415) DBMS Internals- Part V Lecture 14, Oct 18, 2016 Mohammad Hammoud.
CS 440 Database Management Systems
Database Systems (資料庫系統)
CS522 Advanced database Systems
External Sorting Chapter 13
External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and
Evaluation of Relational Operations
Lecture 11: DMBS Internals
Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VI Lecture 18, Mar 25, 2018 Mohammad Hammoud.
Implementation of Relational Operations (Part 2)
Database Management Systems (CS 564)
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 2- Query Processing (continued)
Lecture 28: Index 3 B+ Trees
CS222: Principles of Data Management Lecture #10 External Sorting
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
Lecture 20: Indexes Monday, February 27, 2006.
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
External Sorting Dina Said
CSE 190D Database System Implementation
Presentation transcript:

Lecture#12: External Sorting (R&G, Ch13) Faloutsos/Pavlo CMU 15-415/615 Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos – A. Pavlo Lecture#12: External Sorting (R&G, Ch13)

Last Class Static Hashing Extendible Hashing Linear Hashing Faloutsos/Pavlo CMU - 15-415/615 Last Class Static Hashing Extendible Hashing Linear Hashing Hashing vs. B-trees Faloutsos/Pavlo CMU SCS 15-415/615

Today's Class Sorting Overview Two-way Merge Sort External Merge Sort Faloutsos/Pavlo CMU - 15-415/615 Today's Class Sorting Overview Two-way Merge Sort External Merge Sort Optimizations B+trees for Sorting Faloutsos/Pavlo CMU SCS 15-415/615

Why Do We Need to Sort? Bulk loading B+ tree index. SELECT...ORDER BY Bulk loading B+ tree index. Duplicate elimination (DISTINCT) SELECT...GROUP BY Sort-merge join algorithm. Faloutsos/Pavlo CMU SCS 15-415/615

Why Do We Need to Sort? What do we do if the data that we want to sort is larger than the amount of memory that is available to the DBMS? What if multiple queries are running at the same time and they all want to sort data? Why not just use virtual memory? Faloutsos/Pavlo CMU SCS 15-415/615

Overview Files are broken up into N pages. The DBMS has a finite number of B fixed-size buffers. Let’s start with a simple example… Faloutsos/Pavlo CMU SCS 15-415/615

Two-way External Merge Sort Pass 0: Read a page, sort it, write it. Only one buffer page is used Pass 1,2,3,…: requires 3 buffer pages Merge pairs of runs into runs twice as long Three buffer pages used. Memory Memory Memory Disk Faloutsos/Pavlo

Two-way External Merge Sort Faloutsos 15-415/615 Two-way External Merge Sort NULL Input file 3,4 6,2 9,4 8,7 5,6 3,1 2 Each pass we read + write each page in file. # of passes = ⌈ log2 N ⌉ + 1 So total I/O cost is: = 2N ∙ (⌈ log2 N ⌉ + 1) Divide and conquer: Sort subfiles and merge PASS #0 1-Page Runs 3,4 2,6 4,9 7,8 5,6 1,3 2 PASS #1 2-Page Runs 2,3 4,6 4,7 8,9 1,3 5,6 2 PASS #2 4-Page Runs 4,4 6,7 8,9 2,3 1,2 3,5 6 PASS #3 8-Page Runs 1,2 2,3 3,4 4,5 6,6 7,8 9 Faloutsos/Pavlo 6

Two-way External Merge Sort This only requires three buffer pages. Even if we have more buffer space available, this algorithm does not utilize it effectively. Let’s look at the general algorithm… Faloutsos/Pavlo 15-415/615

General External Merge Sort Faloutsos 15-415/615 General External Merge Sort B>3 buffer pages. How to sort a file with N pages? Faloutsos/Pavlo 15-415/615 7

General External Merge Sort Faloutsos 15-415/615 General External Merge Sort Pass 0: Use B buffer pages. Produce ⌈N / B⌉ sorted runs of size B. Pass 1,2,3,…: Merge B-1 runs. Number of passes: 1+⌈logB-1⌈N / B⌉⌉ Cost = 2N∙(# of passes) B Memory Pages Disk Disk ⋮ ⋮ Page 2 Page 3 Page B-1 ⋮ Page 1 Output Faloutsos/Pavlo 15-415/615 7

✔ Example Sort 108 page file with 5 buffer pages: Faloutsos 15-415/615 Example Sort 108 page file with 5 buffer pages: Pass 0: ⌈108 / 5⌉ = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: ⌈22 / 4⌉ = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages ✔ 1+⌈logB-1⌈N / B⌉⌉ = 1+⌈log4 22⌉ = 1+⌈2.229...⌉  4 passes Faloutsos/Pavlo 15-415/615 8

Today's Class Sorting Overview Two-way Merge Sort External Merge Sort Faloutsos/Pavlo CMU - 15-415/615 Today's Class Sorting Overview Two-way Merge Sort External Merge Sort Optimizations B+trees for Sorting Faloutsos/Pavlo CMU SCS 15-415/615

Blocked I/O & Double-buffering So far, we assumed random disk access. The cost changes if we consider that runs are written (and read) sequentially. What could we do to exploit it? Blocked I/O: Exchange a few random reads for several sequential ones using bigger pages. Double-buffering: Mask I/O delays with prefetching. Faloutsos/Pavlo 15-415/615

Blocked I/O Normally, B buffers of size (say) 4K Faloutsos 15-415/615 Blocked I/O Normally, B buffers of size (say) 4K INSTEAD: B/b buffers, of size ‘b’ kilobytes INPUT 1 OUTPUT Disk ⋮ ⋮ INPUT 2 Disk Faloutsos/Pavlo 7

Blocked I/O Normally, B buffers of size (say) 4K Faloutsos 15-415/615 Blocked I/O Normally, B buffers of size (say) 4K INSTEAD: B/b buffers, of size ‘b’ kilobytes Advantages? Disadvantages? Fewer random disk accesses because some of them are sequential. Smaller fanout may cause more passes. Less parallelization. Faloutsos/Pavlo 15-415/615 7

Blocked I/O & Double-buffering So far, we assumed random disk access Cost changes, if we consider that runs are written (and read) sequentially What could we do to exploit it? Blocked I/O: Exchange a few random reads for several sequential ones using bigger pages. Double-buffering: Mask I/O delays with prefetching. Faloutsos/Pavlo 15-415/615

Double-buffering Normally, when, say ‘INPUT1’ is exhausted Faloutsos 15-415/615 Double-buffering Normally, when, say ‘INPUT1’ is exhausted We issue a “read” request and then we wait… INPUT 1 OUTPUT Disk ⋮ ⋮ INPUT 2 Disk 6 Main memory buffers Faloutsos/Pavlo 7

Faloutsos 15-415/615 Double-buffering We can prefetch the next block into a spare buffer using a asynchronous thread. INPUT 1 OUTPUT Disk Prefetch ⋮ ⋮ INPUT 2 Disk Prefetch 6 Main memory buffers Faloutsos/Pavlo 7

Double-buffering This potentially requires more passes. Faloutsos 15-415/615 Double-buffering This potentially requires more passes. But in practice, most files still sorted in 2-3 passes. INPUT 1 OUTPUT Disk Prefetch ⋮ ⋮ INPUT 2 Disk Prefetch 6 Main memory buffers Faloutsos/Pavlo 7

Today's Class Sorting Overview Two-way Merge Sort External Merge Sort Faloutsos/Pavlo CMU - 15-415/615 Today's Class Sorting Overview Two-way Merge Sort External Merge Sort Optimizations B+trees for Sorting Faloutsos/Pavlo CMU SCS 15-415/615

Using B+ Trees for Sorting Faloutsos 15-415/615 Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered B+ tree is not clustered Good Idea! Could be Bad! Faloutsos/Pavlo 15-415/615 15

Clustered B+Tree for Sorting Faloutsos 15-415/615 Clustered B+Tree for Sorting Traverse to the left-most leaf, then retrieve all leaf pages. Index (Directs search) Data Entries ("Sequence set") Data Records Always better than external sorting! Faloutsos/Pavlo 15-415/615 16

Unclustered B+Tree for Sorting Faloutsos 15-415/615 Unclustered B+Tree for Sorting Chase each pointer to the page that contains the data. Index (Directs search) Data Entries ("Sequence set") Data Records In general, one I/O per data record! Faloutsos/Pavlo 15-415/615 16

Summary External sorting is important. Faloutsos 15-415/615 Summary External sorting is important. External merge sort minimizes disk I/O: Pass 0: Produces sorted runs of size B (# buffer pages). Later Passes: Merge runs. Clustered B+ tree is good for sorting; unclustered tree is usually bad. Faloutsos/Pavlo 15-415/615 19