Implementation of Relational Operations (Part 2) R&G - Chapters 12 and 14.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

Implementation of relational operations
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
1 Lecture 23: Query Execution Friday, March 4, 2005.
Join Processing in Databases Systems with Large Main Memories
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CMSC424: Database Design Instructor: Amol Deshpande
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Bhargav Vadher (208) APRIL 9 th, 2008 Submittetd To: Dr. T Y Lin Computer Science Department San Jose State University.
CMU SCS /615Faloutsos/Pavlo1 Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications C. Faloutsos & A. Pavlo Lecture #13: Query.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.
Unary Query Processing Operators Not in the Textbook!
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Evaluation of Relational Operations Yanlei Diao UMass Amherst March 01, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Database Management Systems 1 Raghu Ramakrishnan Evaluation of Relational Operations Chpt 14.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
SQL and Query Execution for Aggregation. Example Instances Reserves Sailors Boats.
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 6.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 5.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Systems (資料庫系統)
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
Implementation of Relational Operations (Part 2)
Relational Operations
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
(Two-Pass Algorithms)
Highlights of Query Processing And Optimization
Overview of Query Evaluation
Overview of Query Evaluation
Lecture 24: Query Execution
Evaluation of Relational Operations: Other Techniques
Lecture 22: Query Execution
CPSC-608 Database Systems
Lecture 22: Friday, November 22, 2002.
Evaluation of Relational Operations: Other Techniques
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

Implementation of Relational Operations (Part 2) R&G - Chapters 12 and 14

An Alternative to Sorting: Hashing! Idea: –Many of the things we use sort for don’t exploit the order of the sorted data –e.g.: removing duplicates in DISTINCT –e.g.: finding matches in JOIN Often good enough to match all tuples with equal values Hashing does this! –And may be cheaper than sorting! (Hmmm…!) –But how to do it for data sets bigger than memory??

General Idea Two phases: –Partition: use a hash function h to split tuples into partitions on disk. Key property: all matches live in the same partition. –ReHash: for each partition on disk, build a main- memory hash table using a hash function h2

Two Phases Partition: Rehash: Partitions Hash table for partition R i (<= B pages) B main memory buffers Disk Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Duplicate Elimination using Hashing read one bucket at a time for each group of identical tuples, output one Partitions Hash table for partition R i (<= B pages) B main memory buffers Disk Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Cost of External Hashing cost = 4*|R| IO’s

Memory Requirement How big of a table can we hash in two passes? –B-1 “partitions” result from Phase 0 –Each should be no more than B pages in size –Answer: B(B-1). Said differently: We can hash a table of size N pages in about space –Note: assumes hash function distributes records evenly! Have a bigger table? Recursive partitioning!

How does this compare with external sorting?

Cost of External Sorting cost = 4*|R| IO’s

Memory Requirement for External Sorting How big of a table can we sort in two passes? –Each “sorted run” after Phase 0 is of size B –Can merge up to B-1 sorted runs in Phase 1 –Answer: B(B-1). Said differently: We can sort a table of size N pages in about space Have a bigger table? Additional merge passes!

So which is better ?? Based on our simple analysis: –Same memory requirement for 2 passes –Same IO cost Digging deeper … Sorting pros: –Great if input already sorted (or almost sorted) –Great if need output to be sorted anyway –Not sensitive to “data skew” or “bad” hash functions Hashing pros: –Highly parallelizable (will discuss later in semester) –Can exploit extra memory to reduce # IOs (stay tuned…)

Q: Can we use hashing for JOIN ? before we optimize hashing further …

Hash Join Partitions of R & S Input buffer for Si Hash table for partition Ri (B-2 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Cost of Hash Join Partitioning phase: read+write both relations  2(|R|+|S|) I/Os Matching phase: read+write both relations  |R|+|S| I/Os Total cost of 2-pass hash join = 3(|R|+|S|) Q: what is cost of 2-pass sort join? Q: how much memory needed for 2-pass sort join? Q: how much memory needed for 2-pass hash join?

Have B memory buffers Want to hash relation of size N An important optimization to hashing N # passes B B2B2 1 2 If B < N < B2, will have unused memory … cost N 3N

1 Hybrid Hashing Idea: keep one of the hash buckets in memory! B main memory buffers Disk Original Relation OUTPUT 3 INPUT 2 h B-k Partitions 2 3 B-k... h3 k-buffer hashtable Q: how do we choose the value of k?

Cost reduction due to hybrid hashing Now: N # passes B B2B2 1 2 cost N 3N

Summary: Hashing vs. Sorting Sorting pros: –Good if input already sorted, or need output sorted –Not sensitive to data skew or bad hash functions Hashing pros: –Often cheaper due to hybrid hashing –For join: # passes depends on size of smaller relation –Highly parallelizable