Cse 344 April 25th – Disk i/o.

Slides:

Advertisements

Similar presentations

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Advertisements

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

1 Lecture 23: Query Execution Friday, March 4, 2005.

CS CS4432: Database Systems II Operator Algorithms Chapter 15.

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

1 Implementation of Relational Operations Module 5, Lecture 1.

Lecture 24: Query Execution Monday, November 20, 2000.

SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:

1 Lecture 22: Query Execution Wednesday, March 2, 2005.

1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

1 Implementation of Relational Operations: Joins.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.

CSCE Database Systems Chapter 15: Query Execution 1.

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

CS4432: Database Systems II Query Processing- Part 3 1.

CS411 Database Systems Kazuhiro Minami 11: Query Execution.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Lecture 17: Query Execution Tuesday, February 28, 2001.

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

1 Lecture 23: Query Execution Monday, November 26, 2001.

CS 540 Database Management Systems

Record Storage, File Organization, and Indexes

CS 440 Database Management Systems

Database Management System

Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.

Chapter 12: Query Processing

Evaluation of Relational Operations

Cse 344 February 14th – Indexing.

Chapter 15 QUERY EXECUTION.

Database Management Systems (CS 564)

Evaluation of Relational Operations: Other Operations

February 12th – RDBMS Internals

File Processing : Query Processing

File Processing : Query Processing

Relational Operations

Dynamic Hashing Good for database that grows and shrinks in size

Query Optimization Overview

Yan Huang - CSCI5330 Database Implementation – Access Methods

Cse 344 APRIL 23RD – Indexing.

Query processing and optimization

February 16th – Disk i/o and estimation

April 27th – Cost Estimation

Lecture 2- Query Processing (continued)

Chapter 12 Query Processing (1)

Implementation of Relational Operations

Lecture 24: Query Execution

Lecture 13: Query Execution

Query Execution Index Based Algorithms (15.6)

Lecture 23: Query Execution

Evaluation of Relational Operations: Other Techniques

Overview of Query Evaluation: JOINS

Lecture 22: Query Execution

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

Lecture 22: Query Execution

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Lecture 11: B+ Trees and Query Execution

Lecture 22: Friday, November 22, 2002.

Evaluation of Relational Operations: Other Techniques

Lecture 24: Query Execution

Lecture 20: Query Execution

Presentation transcript:

Cse 344 April 25th – Disk i/o

Administrivia HW4 Due Tonight OQ5 Due Tonight HW5 Out Tonight SQL++ Due next Wednesday, 11:30

Which Indexes? The index selection problem Student Which Indexes? ID fName lName 10 Tom Hanks 20 Amy … The index selection problem Given a table, and a “workload” (big Java application with lots of SQL queries), decide which indexes to create (and which ones NOT to create!) Who does index selection: The database administrator DBA Semi-automatically, using a database administration tool

Index Selection: Which Search Key Make some attribute K a search key if the WHERE clause contains: An exact match on K A range predicate on K A join on K

The Index Selection Problem 1 V(M, N, P); Your workload is this 100000 queries: 100 queries: SELECT * FROM V WHERE N=? SELECT * FROM V WHERE P=?

The Index Selection Problem 1 V(M, N, P); Your workload is this 100000 queries: 100 queries: SELECT * FROM V WHERE N=? SELECT * FROM V WHERE P=? What indexes ?

The Index Selection Problem 1 V(M, N, P); Your workload is this 100000 queries: 100 queries: SELECT * FROM V WHERE N=? SELECT * FROM V WHERE P=? A: V(N) and V(P) (hash tables or B-trees)

The Index Selection Problem 2 V(M, N, P); Your workload is this 100000 queries: 100 queries: 100000 queries: SELECT * FROM V WHERE N>? and N<? SELECT * FROM V WHERE P=? INSERT INTO V VALUES (?, ?, ?) What indexes ?

The Index Selection Problem 2 V(M, N, P); Your workload is this 100000 queries: 100 queries: 100000 queries: SELECT * FROM V WHERE N>? and N<? SELECT * FROM V WHERE P=? INSERT INTO V VALUES (?, ?, ?) Stopped here fall 16 A: definitely V(N) (must B-tree); unsure about V(P)

The Index Selection Problem 3 V(M, N, P); Your workload is this 100000 queries: 1000000 queries: 100000 queries: SELECT * FROM V WHERE N=? SELECT * FROM V WHERE N=? and P>? INSERT INTO V VALUES (?, ?, ?) What indexes ?

The Index Selection Problem 3 V(M, N, P); Your workload is this 100000 queries: 1000000 queries: 100000 queries: SELECT * FROM V WHERE N=? SELECT * FROM V WHERE N=? and P>? INSERT INTO V VALUES (?, ?, ?) A: V(N, P) How does this index differ from: Two indexes V(N) and V(P)? An index V(P, N)?

The Index Selection Problem 4 V(M, N, P); Your workload is this 1000 queries: 100000 queries: SELECT * FROM V WHERE N>? and N<? SELECT * FROM V WHERE P>? and P<? What indexes ? CSE 344 - 2017au

The Index Selection Problem 4 V(M, N, P); Your workload is this 1000 queries: 100000 queries: SELECT * FROM V WHERE N>? and N<? SELECT * FROM V WHERE P>? and P<? A: V(N) unclustered, V(P) clustered index

Two typical kinds of queries Point queries What data structure should be used for index? SELECT * FROM Movie WHERE year = ? Range queries What data structure should be used for index? SELECT * FROM Movie WHERE year >= ? AND year <= ?

Basic Index Selection Guidelines Consider queries in workload in order of importance Consider relations accessed by query No point indexing other relations Look at WHERE clause for possible search key Try to choose indexes that speed-up multiple queries

To Cluster or Not Range queries benefit mostly from clustering Point indexes do not need to be clustered: they work equally well unclustered Because we are not looking up the rest of the record

SELECT * FROM R WHERE R.K>? and R.K<? Cost 100 Percentage tuples retrieved

SELECT * FROM R WHERE R.K>? and R.K<? Cost Sequential scan 100 Percentage tuples retrieved

SELECT * FROM R WHERE R.K>? and R.K<? Cost Sequential scan Clustered index 100 Percentage tuples retrieved

SELECT * FROM R WHERE R.K>? and R.K<? Unclustered index Cost Sequential scan Clustered index 100 Percentage tuples retrieved

Choosing Index is Not Enough To estimate the cost of a query plan, we still need to consider other factors: How each operator is implemented The cost of each operator Let’s start with the basics

Cost Parameters Cost = I/O + CPU + Network BW We will focus on I/O in this class Parameters (a.k.a. statistics): B(R) = # of blocks (i.e., pages) for relation R T(R) = # of tuples in relation R V(R, a) = # of distinct values of attribute a

Cost Parameters Cost = I/O + CPU + Network BW We will focus on I/O in this class Parameters (a.k.a. statistics): B(R) = # of blocks (i.e., pages) for relation R T(R) = # of tuples in relation R V(R, a) = # of distinct values of attribute a When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anything <= T(R)

Cost Parameters Cost = I/O + CPU + Network BW We will focus on I/O in this class Parameters (a.k.a. statistics): B(R) = # of blocks (i.e., pages) for relation R T(R) = # of tuples in relation R V(R, a) = # of distinct values of attribute a DBMS collects statistics about base tables must infer them for intermediate results When a is a key, V(R,a) = T(R) When a is not a key, V(R,a) can be anything <= T(R)

Selectivity Factors for Conditions A = c /* σA=c(R) */ Selectivity = 1/V(R,A) A < c /* σA<c(R)*/ Selectivity = (c - min(R, A))/(max(R,A) - min(R,A)) c1 < A < c2 /* σc1<A<c2(R)*/ Selectivity = (c2 – c1)/(max(R,A) - min(R,A))

Cost of Reading Data From Disk Sequential scan for relation R costs B(R) Index-based selection Estimate selectivity factor f (see previous slide) Clustered index: f*B(R) Unclustered index f*T(R) Note: we ignore I/O cost for index pages

Index Based Selection Example: Table scan: Index based selection: B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: Index based selection: cost of sa=v(R) = ?

Index Based Selection B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: B(R) = 2,000 I/Os Index based selection: cost of sa=v(R) = ?

Index Based Selection Example: Table scan: B(R) = 2,000 I/Os B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: B(R) = 2,000 I/Os Index based selection: If index is clustered: If index is unclustered: cost of sa=v(R) = ?

Index Based Selection Example: Table scan: B(R) = 2,000 I/Os B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: B(R) = 2,000 I/Os Index based selection: If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: cost of sa=v(R) = ?

Index Based Selection Example: Table scan: B(R) = 2,000 I/Os B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: B(R) = 2,000 I/Os Index based selection: If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os cost of sa=v(R) = ?

Index Based Selection Example: Table scan: B(R) = 2,000 I/Os B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan: B(R) = 2,000 I/Os Index based selection: If index is clustered: B(R) * 1/V(R,a) = 100 I/Os If index is unclustered: T(R) * 1/V(R,a) = 5,000 I/Os cost of sa=v(R) = ? Lesson: Don’t build unclustered indexes when V(R,a) is small !

Outline Join operator algorithms One-pass algorithms (Sec. 15.2 and 15.3) Index-based algorithms (Sec 15.6) Note about readings: In class, we discuss only algorithms for joins Other operators are easier: read the book

Join Algorithms Hash join Nested loop join Sort-merge join

Hash Join Hash join: R ⋈ S Scan R, build buckets in main memory Then scan S and join Cost: B(R) + B(S) Which relation to build the hash table on?

Hash Join Hash join: R ⋈ S Scan R, build buckets in main memory Then scan S and join Cost: B(R) + B(S) Which relation to build the hash table on? One-pass algorithm when B(R) ≤ M M = number of memory pages available

Hash Join Example Patient(pid, name, address) Insurance(pid, provider, policy_nb) Patient Insurance Two tuples per page Patient Insurance 1 ‘Bob’ ‘Seattle’ 2 ‘Blue’ 123 2 ‘Ela’ ‘Everett’ 4 ‘Prem’ 432 3 ‘Jill’ ‘Kent’ 4 ‘Prem’ 343 4 ‘Joe’ ‘Seattle’ 3 ‘GrpH’ 554

Hash Join Example Patient Insurance Patient Insurance Some large-enough # Patient Insurance Memory M = 21 pages Showing pid only Disk Patient Insurance 1 2 2 4 6 6 3 4 4 3 1 3 9 6 2 8 This is one page with two tuples 8 5 8 9

Hash Join Example Step 1: Scan Patient and build hash table in memory Can be done in method open() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 1 2 1 2 2 4 6 6 Input buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Write to disk or pass to next operator Hash Join Example Step 2: Scan Insurance and probe into hash table Done during calls to next() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 2 4 1 2 2 1 2 2 4 6 6 Input buffer Output buffer 3 4 4 3 1 3 Write to disk or pass to next operator 9 6 2 8 8 5 8 9

Hash Join Example Step 2: Scan Insurance and probe into hash table Done during calls to next() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 2 4 1 2 4 1 2 2 4 6 6 Input buffer Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Hash Join Example Step 2: Scan Insurance and probe into hash table Done during calls to next() Memory M = 21 pages Hash h: pid % 5 4 3 9 6 8 5 1 2 Disk Patient Insurance 4 3 1 2 4 Say “close()” deletes the hash table 1 2 2 4 6 6 Input buffer Output buffer 3 4 4 3 1 3 Keep going until read all of Insurance 9 6 2 8 Cost: B(R) + B(S) 8 5 8 9

Nested Loop Joins Tuple-based nested loop R ⋈ S R is the outer relation, S is the inner relation for each tuple t1 in R do for each tuple t2 in S do if t1 and t2 join then output (t1,t2) What is the Cost?

Nested Loop Joins Tuple-based nested loop R ⋈ S R is the outer relation, S is the inner relation Cost: B(R) + T(R) B(S) Multiple-pass since S is read many times for each tuple t1 in R do for each tuple t2 in S do if t1 and t2 join then output (t1,t2) What is the Cost?

Page-at-a-time Refinement for each page of tuples r in R do for each page of tuples s in S do for all pairs of tuples t1 in r, t2 in s if t1 and t2 join then output (t1,t2) Cost: B(R) + B(R)B(S) What is the Cost?

Page-at-a-time Refinement 1 2 Input buffer for Patient 2 4 Input buffer for Insurance Disk Patient Insurance 2 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Page-at-a-time Refinement 1 2 1 2 Input buffer for Patient 4 3 Input buffer for Insurance Disk Patient Insurance 1 2 2 4 6 6 Output buffer 3 4 4 3 1 3 9 6 2 8 8 5 8 9

Page-at-a-time Refinement 1 2 1 2 Input buffer for Patient 2 8 Input buffer for Insurance Disk Patient Insurance Keep going until read all of Insurance 2 1 2 2 4 6 6 Output buffer Then repeat for next page of Patient… until end of Patient 3 4 4 3 1 3 9 6 2 8 Cost: B(R) + B(R)B(S) 8 5 8 9

Block-Nested-Loop Refinement for each group of M-1 pages r in R do for each page of tuples s in S do for all pairs of tuples t1 in r, t2 in s if t1 and t2 join then output (t1,t2) Stopped here Winter 2017 Cost: B(R) + B(R)B(S)/(M-1) What is the Cost?