1 Sort-Merge Join Implementation Details for Minibase by Demetris Zeinalipour University of California – Riverside Department of Computer Science & Engineering cs179G – Database Project Phase #4
2 Sort-Merge Join Review Query with JOIN SELECT S.name FROM Sailors S, Reserves R WHERE S.sid = R.id SIDNAMEAGE 1Carl23 2Tom26 3Peter23 SIDBID Sailors SReserves R How to implement the JOIN operator? 1)Using Nested Loop Joins. tuple-at-a-time: For each tuple in S check with every tuple in R variation: page-at-a-time (every page contains several records) 2)Using Block Nested Loop Join. Idea: Load in memory smaller relation e.g. S and then scan relation S on a page-at-a-time basis. Notice: The idea can be generalized even if the smaller relation doesn’t fit in memory 3)Using Sort-Merge Join Idea: Sort both relations using an external sort algorithm and then merge the relations. Name Carl Tom
3 External-Sorting Review 1/3 1.When? If the data to be sorted is too big to fit in main memory then we need an external sort algorithm. 2.Simple approach: 2-way Merge-Sort HeapFile 1 Memory inpage1 outpage inpage2 Use quicksort internally Inpage1:[ 11,7, 2,1] Inpage2:[19, 7,5,4 ] } Steps 1)Fetch a page to memory 2)Sort Page in memory 3)Write page back to disk 4)Merge pages levelwise (see next page)
4 External-Sorting Review 2/3 2-way mergesort => Passes: log 2 N+1=4 Cost: 2N(log 2 N+1)=64 I/O Expensive In project we use External Sort (sort.C) which utilizes all available buffer pages (10) and reduces the number of Passes and the I/O cost
5 External-Sorting Review 3/3 Idea similar with 2-way Mergesort with the difference that we utilize B-1 buffer pages (B>3) Heapf Memory 3 outpage Use quicksort internally In this project you should call: Sort() from within the sortMerge constructor before proceeding to the merge phase Don’t worry about this implementation as it is already implemented in sort.C 4
6 The Merging Phase of SMJoin Now that the two relations R and S are sorted we must merge them. Merge using 2 iterators to move from page to page and from record to record Works fine ONLY if both R and S have NO duplicates. (e.g if sid is a foreign key in a 1:1 relation Sailor, Address) R.sid S.sid TrTr GsGs Forward the iterator (Tr or Gs) that has the smallest value until Tr=Gs then output value Page Heapfile
7 The Merging Phase of SMJoin What if relations have duplicates? (check example) (either both of them or just one of them) Therefore we need to use 3 iterators (1 for Marking) R.sid S.sid TrTr GsGs TsTs The one extra Ts iterator will be used as soon Tr=Gs at which point we will move Ts to Gs position and use Ts to iterate S The full version of the algorithm is shown in 2 slides
8 HeapFiles (heapfile.h, scan.C) 1.Database File: Organization of various pages into a logical. 2.In a heapfile pages are unordered within the file 3.In order to Scan the pages (records) of a heapfile we will use scan.C Example: // Sorting a heapfile (after : heapR stored in Catalog) Sort(“unsortheapR”, “heapR”,..)rest of params from SortMerge() // Creating a scan on a heapfile HeapFile heapR(“heapR”, status); Scan * Rscan = heapR.openScan(Rstatus); // Scan until DONE Rstatus = Rscan->getNext(RID rid,char* RecR, int lenR); // Inserting results in Out Heapfile HeapFile heapOut(“heapOut”, status); memmove(char *RecO, char* RecR, int lenR);//do same for S heapOut.insertRecord(char *recptr, lenR+lenS, RID&outRID); All the work of Pinning/Unpinning pages is done from within Scan since it locates the directoryheader Page from the catalog and proceeds from there on with getNext()
9 The Merging Phase of SMJoin The full algorithm (this is all you need to implement) } Fast forward R } Fast forward S } Sort R & S } Init Iterators Consider Scan::Position()
10 } //sortMerge Implementation of the SMJoin Algorithm The Big Picture main.C (or smjoin_main.C same thing) => SMJTester.C (runTests()) => test1() (this runs all 6 tests actually) => createFiles(); // creates 5 Heapfiles using // the data of same constant integer arrays (data0,…data4) => test(i) { // inside sortMerge Constructor heapfileRNo_of_cols joinColumn Out_heapfile heapfileS minirel.hNo_of_pages available sortMerge sm(“file0”, 2, attrType, attrsize, 0, “file1”, 2, attrType, attrsize, 0, “test1”, 10, Ascending, s); { // inside sortMerge Constructor => Sort( (infile) “file0”, (outfile) “BRfile1”, (no_of_cols)2, attrType, attrsize, (joincolumn)0, (no_of_pages_available)10, (status)s )
11 Where to start from? 1.Start out by reading Sort-Merge Join from book (Chap nd edition, rd edition) 2.Study SMJTester.C which contains the tests and understand the test scenarios. 3.Start Implementing sortMerge.C by invoking the Sort() constructor etc. 4.Have a closer look at the new classes that you have: Sort.C, Heapfile.C and Scan.C