Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Implementation of relational operations
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
1 Implementation of Relational Operations Module 5, Lecture 1.
Evaluation of Relational Operators 198:541. Relational Operations  We will consider how to implement: Selection ( ) Selects a subset of rows from relation.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Query Processing 1: Joins and Sorting R&G, Chapters 12, 13, 14 Lecture 8 One of the advantages of being disorderly is that one is constantly making exciting.
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Implementation of Relational Operations: Joins.
Query Processing 2: Sorting & Joins
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Lec3/Database Systems/COMP4910/031 Evaluation of Relational Operations Chapter 14.
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
1 Database Systems ( 資料庫系統 ) December 7, 2011 Lecture #11.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Database Management Systems 1 Raghu Ramakrishnan Evaluation of Relational Operations Chpt 14.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Systems (資料庫系統)
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Evaluation of Relational Operations
Relational Operations
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #12 Joins!
CS222: Principles of Data Management Notes #12 Joins!
Selected Topics: External Sorting, Join Algorithms, …
Overview of Query Evaluation
Lecture 2- Query Processing (continued)
Chapter 12 Query Processing (1)
Implementation of Relational Operations
Lecture 13: Query Execution
Overview of Query Evaluation: JOINS
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page

Simple Nested Loops Join For each tuple r in R do for each tuple s in S do if r.A == S.B then add to result Scan the outer relation R. For each tuple r in R, scan the entire inner relation S. Ignore CPU cost and cost for writing result to disk R has M pages with P R tuples per page. S has N pages with P S tuples per page. Nested loop join cost = M+M*P R *N. Cost to scan R. Memory Output 1 page for S 1 page for R

Exercise What is the I/O cost for using a simple nested loops join? –Sailors: 1000 pages, 100 records/page, 50 bytes/record –Reserves: 500 pages, 80 records/page, 40 bytes/record Ignore the cost of writing the results Join selectivity factor: 0.1

Exercise What is the I/O cost for using a simple nested loops join? –Sailors: 1000 pages, 100 records/page, 50 bytes/record –Reserves: 500 pages, 80 records/page, 40 bytes/record Ignore the cost of writing the results Join selectivity factor: 0.1 Answer If Sailors is the outer relation, –I/O cost = ,000*500 = 50,001,000 I/Os If Reserves is the outer relation, –I/O cost = ,000*1000 = 40,000,500 I/Os

Block Nested Loops Join B-2 pages for R Memory Output buffer R Bring bigger relation S one page at a time. Cost=M+N Optimal if one of the relation can fit in the memory (M=B-2). For each r i of B-2 pages of R do For each s j of s in S do if r i.a == s j.a then add to result Cost=M+ B: Available memory in pages. Number of blocks of R for each retrieval 1 page for S 1 page * N

Exercise What is the I/O cost for using a block nested loops join? –Sailors: 1000 pages, 100 records/page, 50 bytes/record –Reserves: 500 pages, 80 records/page, 40 bytes/record Ignore the cost of writing the results Memory available: 102 Blocks

Exercise What is the I/O cost for using a block nested loops join? –Sailors: 1000 pages, 100 records/page, 50 bytes/record –Reserves: 500 pages, 80 records/page, 40 bytes/record Ignore the cost of writing the results Memory available: 102 Blocks Answer Sailors is the outer relation #times to bring in the entire Reserves I/O cost = *5=5500

Indexed Nested Loops Join for each tuple r in R do for each tuple s in S do if r i. A == s j.B then add to result Use index on the joining attribute of S. Cost=M+M*P R *(Cost of retrieving a matching tuple in S). Depend on the type of index and the number of matching tuples. cost to scan R

Exercise What is the I/O cost for using an indexed nested loops join? Ignore the cost of writing the results Memory available: 102 Blocks There is only a hash-based, dense index on sid using Alternative 2 Answer Sailors is the outer relation I/O Cost = ,000*(1.2+1)=88500 I/Os

Grace Hash-Join Partition Phase Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. Join (probing) Phase v Read in a partition of R, hash it using h2 (<> h!). v Scan matching partition of S to search for matching tuples Partitions of R & S Input buffer for Si Hash table for partition Ri (k <= B-2 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Cost of Grace Join Assumption: –Each partition fits in the B-2 pages –I/O cost for a read and a write is the same –Ignore the cost of writing the join results Disk I/O Cost –Partitioning Phase: I/O Cost: 2*M+2*N –Probing Phase I/O Cost: M+N –Total Cost 3*(M+N) Which one to use, block-nested loop or Grace join? –Block-nested loop : Cost=M+ * N

What is the I/O cost for using Grace hash join? Ignore the cost of writing the results Assuming that each partition fits in memory. Answer 3*( )=4500 I/Os Exercise

Ideally, each partition fits in memory –To increase this chance, we need to minimize partition size, which means to maximize #partitions Questions –What limits the number of partitions? –What is the minimum memory requirement? Partition Phase Probing phase Memory Requirement for Hash Join

Partition Phase –To partition R into K partitions, we need at least K output buffer and one input buffer –Given B buffer pages, the maximum number of partition is B-1 –Assume a uniform distribution, the size of each R partition is equal to M/(B-1) B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Probing phase –# pages for the in-memory hash table built during the probing phase is equal to f*M/(B-1) f: fudge factor that captures the increase in the hash table size from the buffer size –Total number of pages since one page for input buffer for S and another page for output buffer Partitions of R & S Input buffer for Si Hash table for partition Ri (k <= B-2 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2

If memory is not enough to store a smaller partition Divide a partition of R into sub-partitions using another hash function h3 Divide a partition of S into sub-partitions using another hash function h3 Sub-partition j of partition i in R only matches sub-partition j of partition i in S

Sort-Merge Join (R S) Sort R and S on the join attribute, then scan them to do a ``merge’’ (on join col.), and output result tuples. – Advance scan of R until current tuple R.i >= current tuple S.j, then advance scan of S until current S.j >= current R.i; do this until current R.i = current S.j. – At this point, all R tuples with same value in R.i (current R partition) and all S tuples with same value in S.j (current S partiton) match; – output for all pairs of such tuples. – Then resume scanning R and S. i=j i j : : Both are sorted

Example of Sort-Merge Join I/O Cost for Sort-Merge Join Cost of sorting: TBD O(|R| log |R|)+O(|S| log |S|) ?? Cost of merging: M+N Could be up to O(M+N) if the inner-relation has to be scanned multiple times (very unlikely) Sailors Reserves

Sort-Merge Join Attractive if one relation is already sorted on the join attribute or has a clustered index on the join attribute

Exercise What is the I/O cost for using a sort-merge join? Ignore the cost of writing the results Buffer pool: 100 pages Answer Sort Reserves in 2 passes:2*( ) =4000 Sort Sailors in 2 passes: 2*( )=2000 Merge: ( ) =1500 Total cost = = 7500

Cost comparison for Exercise Example Block nested loops join: 5500 Index nested loops join using a hash-index: Grace Hash Join: 4500 Sort-Merge Join: 7500

Why Sort? Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of RAM. – why not virtual memory?

2-Way Sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. – only one buffer page is used Pass 2, 3, …, etc.: – three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk

P1 INPUT 1 INPUT 2 OUTPUT P2 P1P2 P3 INPUT 1 INPUT 2 OUTPUT P4 P3P4 Example: Sorting 4 pages P1 and P2 are sorted individually P3 and P4 are sorted individidually P1 and P2 are sorted P3 and P4 are sorted

P1 INPUT 1 INPUT 2 OUTPUT P2 P3P4 P3P2P1 The data in 4 pages are sorted

Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Idea: Divide and conquer: sort subfiles and merge Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,45,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8

Two-Phase Multi-Way Merge-Sort Phase 1: 1.Fill all available main memory with blocks from the original relation to be sorted. 2.Sort the records in main memory using main memory sorting techniques. 3.Write the sorted records from main memory onto new blocks of secondary memory, forming one sorted sublist. (There may be any number of these sorted sublists, which we merge in the next phase). * More than 3 buffer pages. How can we utilize them? File to be sorted MM Sorted f1 Sorted f2 Sorted fn : n = N/M, where N is the file size and P is the main memory in pages

Phase 2: Multiway Merge-Sort Pointers to first unchosen records Select smallest unchosen key from the list for output Output buffer Input buffer, one for each sorted list Done in main memory Merge all the sorted sublists into a single sorted list. Partition MM into n blocks Load fi to block i

Find the smallest key among the first remaining elements of all the lists. ( This comparison is done in main memory and a linear search is sufficient. Better technique can be used.) Move the smallest element to the first available position of the output block. If the output block is full, write it to disk and reinitialize the same buffer in main memory to hold the next output block. If the block from which the input smallest element was taken is now exhausted, read the next block from the same sorted sublist into the same buffer. If no block remains, leave its buffer empty and do not consider elements from that list. Phase 2: Multiway Merge-Sort

Memory Requirement for Multi-Way Merge-Sort Partitioning Phase: Given M pages of main memory and a file of N pages of data, we can partition the file into N/M small files. File to be sorted MM Sorted f1 Sorted f2 Sorted fn : Merge Phase: At least M-1 pages of main memory are needed to merge N/M sorted lists, so we have M-1>=N/M, i.e., M(M-1)>=N. Main memory buffers INPUT 1 OUTPUT Disk INPUT 2 INPUT n

1.pick one element in the array, which will be the pivot. 2.make one pass through the array, called a partition step, re- arranging the entries so that: a)the pivot is in its proper place. b)entries to the left of the pivot are smaller than the pivot c)entries to its right are larger than the pivot Detailed Steps: 1.Starting from left, find the item that is larger than the pivot, 2.Starting from right, find the item that is smaller than the pivot 3.Switch left and right 3.recursively apply quicksort to the part of the array that is to the left of the pivot, and to the part on its right. O(n log n) on average, worst case is O(n 2 ) In-memory Sort: Quick Sort left right 1st pass: pivot = 18

Query Cost Estimation Select:  No index  unsorted data  sorted data  Index  tree index  hash-based index Join: R S  Simple nested loop  Block nested loop  Grace Hash  Sort-merge