CSE 444: Lecture 25 Query Execution

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
1 Lecture 23: Query Execution Friday, March 4, 2005.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
Lecture 11: DMBS Internals
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 Lecture 23: Query Execution Wednesday, March 8, 2006.
1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.
CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Lecture 24 Query Execution Monday, November 28, 2005.
CS4432: Database Systems II Query Processing- Part 2.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
1 Lecture 23: Query Execution Monday, November 26, 2001.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
1 Lecture 24: Query Execution Monday, November 27, 2006.
CS 540 Database Management Systems
CS 440 Database Management Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Chapter 13
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Lecture 24: Query Execution and Optimization
Lecture 11: DMBS Internals
Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.
Database Management Systems (CS 564)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Joins Query Optimization 10/4/2017
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Lecture 25: Query Execution
CS222: Principles of Data Management Lecture #10 External Sorting
Lecture 24: Query Execution
Lecture 13: Query Execution
CS505: Intermediate Topics in Database Systems
Lecture 23: Query Execution
Lecture 22: Query Execution
External Sorting.
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
CS222P: Principles of Data Management Lecture #10 External Sorting
Database Systems (資料庫系統)
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 23: Monday, November 25, 2002.
External Sorting Chapter 13
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Lecture 20: Representing Data Elements
Presentation transcript:

CSE 444: Lecture 25 Query Execution Monday, November 29, 2004

Outline External Sorting Sort-based algorithms An example

The I/O Model of Computation In main memory: CPU time Big O notation ! In databases time is dominated by I/O cost Big O too, but for I/O’s Often big O becomes a constant Consequence: need to redesign certain algorithms See sorting next

Sorting Problem: sort 1 GB of data with 1MB of RAM. Where we need this: Data requested in sorted order (ORDER BY) Needed for grouping operations First step in sort-merge join algorithm Duplicate removal Bulk loading of B+-tree indexes. 4

2-Way Merge-sort: Requires 3 Buffers in RAM Pass 1: Read a page, sort it, write it. Pass 2, 3, …, etc.: merge two runs, write them Runs of length 2L Runs of length L INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk 5

Two-Way External Merge Sort Assume block size is B = 4Kb Step 1  runs of length L = 4Kb Step 2  runs of length L = 8Kb Step 3  runs of length L = 16Kb . . . . . . Step 9  runs of length L = 1MB . . . Step 19  runs of length L = 1GB (why ?) Need 19 iterations over the disk data to sort 1GB 6

Can We Do Better ? Hint: We have 1MB of main memory, but only used 12KB

Cost Model for Our Analysis B: Block size ( = 4KB) M: Size of main memory ( = 1MB) N: Number of records in the file R: Size of one record 3

External Merge-Sort Phase one: load M bytes in memory, sort Result: runs of length M bytes ( 1MB ) M/R records . . . . . . Disk Disk M bytes of main memory

Phase Two . . . . . . Merge M/B – 1 runs into a new run (250 runs ) Result: runs of length M (M/B – 1) bytes (250MB) Input 1 . . . Input 2 . . . Output . . . . Input M/B Disk Disk M bytes of main memory 7

Phase Three . . . . . . Merge M/B – 1 runs into a new run Result: runs of length M (M/B – 1)2 records (625GB) Input 1 . . . Input 2 . . . Output . . . . Input M/B Disk Disk M bytes of main memory 7

Cost of External Merge Sort Number of passes: How much data can we sort with 10MB RAM? 1 pass  10MB data 2 passes  25GB data (M/B = 2500) Can sort everything in 2 or 3 passes ! 8

External Merge Sort The xsort tool in the XML toolkit sorts using this algorithm Can sort 1GB of XML data in about 8 minutes

Two-Pass Algorithms Based on Sorting Assumption: multi-way merge sort needs only two passes Assumption: B(R) <= M2 Cost for sorting: 3B(R)

Two-Pass Algorithms Based on Sorting Duplicate elimination d(R) Trivial idea: sort first, then eliminate duplicates Step 1: sort chunks of size M, write cost 2B(R) Step 2: merge M-1 runs, but include each tuple only once cost B(R) Total cost: 3B(R), Assumption: B(R) <= M2

Two-Pass Algorithms Based on Sorting Grouping: ga, sum(b) (R) Same as before: sort, then compute the sum(b) for each group of a’s Total cost: 3B(R) Assumption: B(R) <= M2

Two-Pass Algorithms Based on Sorting x = first(R) y = first(S) While (_______________) do { case x < y: output(x) x = next(R) case x=y: case x > y; } R ∪ S Complete the program in class:

Two-Pass Algorithms Based on Sorting x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } R ∩ S Complete the program in class:

Two-Pass Algorithms Based on Sorting x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } R - S Complete the program in class:

Two-Pass Algorithms Based on Sorting Binary operations: R ∪ S, R ∩ S, R – S Idea: sort R, sort S, then do the right thing A closer look: Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S) Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis Total cost: 3B(R)+3B(S) Assumption: B(R)+B(S)<= M2

Two-Pass Algorithms Based on Sorting R(A,C) sorted on A S(B,D) sorted on B x = first(R) y = first(S) While (_______________) do { case x.A < y.B: case x.A=y.B: case x.A > y.B; } R |x|R.A =S.B S Complete the program in class:

Two-Pass Algorithms Based on Sorting Join R |x| S Start by sorting both R and S on the join attribute: Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match tuples Cost: B(R)+B(S) Difficulty: many tuples in R may match many in S If at least one set of tuples fits in M, we are OK Otherwise need nested loop, higher cost Total cost: 5B(R)+5B(S) Assumption: B(R) <= M2, B(S) <= M2

Two-Pass Algorithms Based on Sorting Join R |x| S If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase Total cost: 3B(R)+3B(S) Assumption: B(R) + B(S) <= M2

Summary of External Join Algorithms Block Nested Loop Join: B(S) + B(R)*B(S)/M Partitioned Hash Join: 3B(R)+3B(S) Assuming min(B(R),B(S)) <= M2 Merge Join Assuming B(R)+B(S) <= M2 Index Join B(R) + T(R)B(S)/V(S,a) Assuming…

Example Select Product.pname From Product, Company Product(pname, maker), Company(cname, city) How do we execute this query ? Select Product.pname From Product, Company Where Product.maker=Company.cname and Company.city = “Seattle”

Example Product(pname, maker), Company(cname, city) Assume: Clustered index: Product.pname, Company.cname Unclustered index: Product.maker, Company.city

Logical Plan: scity=“Seattle” Product (pname,maker) maker=cname scity=“Seattle” Product (pname,maker) Company (cname,city)

Index-based selection Physical plan 1: Index-based join Index-based selection cname=maker scity=“Seattle” Company (cname,city) Product (pname,maker)

Scan and sort (2a) index scan (2b) Physical plans 2a and 2b: Merge-join Which one is better ?? maker=cname scity=“Seattle” Product (pname,maker) Company (cname,city) Index- scan Scan and sort (2a) index scan (2b)

Index-based selection Physical plan 1:  T(Product) / V(Product, maker) Index-based join Index-based selection Total cost: T(Company) / V(Company, city)  T(Product) / V(Product, maker) cname=maker scity=“Seattle” Company (cname,city) Product (pname,maker) T(Company) / V(Company, city)

Scan and sort (2a) index scan (2b) Total cost: (2a): 3B(Product) + B(Company) (2b): T(Product) + B(Company) Physical plans 2a and 2b: Merge-join No extra cost (why ?) maker=cname scity=“Seattle” 3B(Product) Product (pname,maker) Company (cname,city) T(Product) Table- scan Scan and sort (2a) index scan (2b) B(Company)

Which one is better ?? It depends on the data !! Plan 1: T(Company)/V(Company,city)  T(Product)/V(Product,maker) Plan 2a: B(Company) + 3B(Product) Plan 2b: B(Company) + T(Product) Which one is better ?? It depends on the data !!

Example Case 1: V(Company, city)  T(Company) T(Company) = 5,000 B(Company) = 500 M = 100 T(Product) = 100,000 B(Product) = 1,000 We may assume V(Product, maker)  T(Company) (why ?) Case 1: V(Company, city)  T(Company) Case 2: V(Company, city) << T(Company) V(Company,city) = 2,000 V(Company,city) = 20

Which Plan is Best ? Case 1: Case 2: Plan 1: T(Company)/V(Company,city)  T(Product)/V(Product,maker) Plan 2a: B(Company) + 3B(Product) Plan 2b: B(Company) + T(Product) Case 1: Case 2:

Lessons Need to consider several physical plan even for one, simple logical plan No magic “best” plan: depends on the data In order to make the right choice need to have statistics over the data the B’s, the T’s, the V’s