Query Optimization, Concluded and Transactions and Concurrency Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. –Because disk accesses are.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
1 Relational Query Optimization Module 5, Lecture 2.
Relational Query Optimization CS186, Fall 2005 R & G Chapters 12/15.
Relational Query Optimization 198:541. Overview of Query Optimization  Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically.
Concurrency Control and Recovery In real life: users access the database concurrently, and systems crash. Concurrent access to the database also improves.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.
Query Optimization, Concluded and Transactions and Concurrency Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December.
Transactions and Wrap-Up Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 8, 2005 Some slide content derived.
Relational Query Optimization (this time we really mean it)
Relational Query Optimization CS 186, Spring 2006, Lectures16&17 R & G Chapter 15 It is safer to accept any chance that offers itself, and extemporize.
Query Optimization Chapter 15. Query Evaluation Catalog Manager Query Optmizer Plan Generator Plan Cost Estimator Query Plan Evaluator Query Parser Query.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Overview of Query Evaluation R&G Chapter 12 Lecture 13.
1 Transaction Management Overview Yanlei Diao UMass Amherst March 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Query Optimization Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Slide content courtesy Raghu Ramakrishnan.
Query Optimization II R&G, Chapters 12, 13, 14 Lecture 9.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 1, 2005 Some slide content derived.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
Query Execution and Optimization Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 23, 2004.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November.
Transactions and Wrap-Up Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 9, 2004 Some slide content derived.
Transactions and Concurrency Control Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2003 Slide content.
Overview of Query Optimization v Plan : Tree of R.A. ops, with choice of alg for each op. –Each operator typically implemented using a `pull’ interface:
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
Overview of Implementing Relational Operators and Query Evaluation
Introduction to Database Systems1 Relational Query Optimization Query Processing: Topic 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
1 Overview of Query Evaluation Chapter Overview of Query Evaluation  Plan : Tree of R.A. ops, with choice of alg for each op.  Each operator typically.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 136 Database Systems I SQL Modifications and Transactions.
1 Relational Query Optimization Chapter Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.
Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2015.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Relational Query Optimization R & G Chapter 12/15.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
1 Database Systems ( 資料庫系統 ) December 13, 2004 Chapter 15 By Hao-hua Chu ( 朱浩華 )
Implementation of Database Systems, Jarek Gryz1 Relational Query Optimization Chapters 12.
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
Database Applications (15-415) DBMS Internals- Part X Lecture 21, April 3, 2016 Mohammad Hammoud.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction To Query Optimization and Examples Chpt
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Introduction to Query Optimization
Introduction to Database Systems
Examples of Physical Query Plan Alternatives
Query Optimization Overview
Transactions and Wrap-Up
Relational Query Optimization
Query Optimization, Concluded and Transactions and Concurrency
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Query Optimization Overview
Query Optimization.
Relational Query Optimization
Relational Query Optimization (this time we really mean it)
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Relational Query Optimization
Relational Query Optimization
Presentation transcript:

Query Optimization, Concluded and Transactions and Concurrency Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2007 Some slide content derived from Ramakrishnan & Gehrke

2 Reminders  Project demos will be on the 10 th - 12 th  Also due on the 14 th by 2PM: a 5-10 page report describing:  What your project goals were  What you implemented  Basic architecture and design  Division of labor  Take-home final available Dec. 10 th, with 24 hours to complete; due no later than Dec. 14 th, 12PM

Overview of Query Optimization  A query plan: algebraic tree of operators, with choice of algorithm for each op  Two main issues in optimization:  For a given query, which possible plans are considered?  Algorithm to search plan space for cheapest (estimated) plan  How is the cost of a plan estimated?  Ideally: Want to find best plan  Practically: Avoid worst plans!

Relational Algebra Equivalences  Allow us to choose different join orders and to `push’ selections and projections ahead of joins.  Selections: ( Commute )  Projections: (Cascade)  Joins: R ⋈ (S ⋈ T)  (R ⋈ S) ⋈ T (Associative) (R ⋈ S)  (S ⋈ R) (Commute) R ⋈ (S ⋈ T)  (T ⋈ R) ⋈ S  Show that:  a1 (R) ´  a1 (…(  an (R))))  c1 (  c2 (R)) ´  c2 (  c1 (R))  c1^…^cn (R) ´  c1 (…  cn (R))

More Equivalences  A projection commutes with a selection that only uses attributes retained by the projection  Selection between attributes of the two arguments of a cross-product converts cross-product to a join  A selection on ONLY attributes of R commutes with R ⋈ S:  (R ⋈ S)   (R) ⋈ S  If a projection follows a join R ⋈ S, we can “push” it by retaining only attributes of R (and S) that are needed for the join or are kept by the projection

The System-R Optimizer: Establishing the Basic Model  Most widely used model; works well for < 10 joins  Cost estimation: Approximate art at best  Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes  Considers combination of CPU and I/O costs  Plan Space: Too large, must be pruned  Only the space of left-deep plans is considered  Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation  Cartesian products avoided

Schema for Examples  Reserves:  Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.  Sailors:  Each tuple is 50 bytes long, 80 tuples per page, 500 pages. Sailors ( sid : integer, sname : string, rating : integer, age : real) Reserves ( sid : integer, bid : integer, day : dates, rname : string)

Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks, and these are optimized one block at a time.  Nested blocks are usually treated as calls to a subroutine, made once per outer tuple. SELECT S.sname FROM Sailors S WHERE S.age IN ( SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating ) Nested blockOuter block  For each block, the plans considered are: –All available access methods, for each reln in FROM clause. –All left-deep join trees (i.e., all ways to join the relations one-at-a- time, with the inner reln in the FROM clause, considering all reln permutations and join methods.)

Enumeration of Alternative Plans  There are two main cases:  Single-relation plans  Multiple-relation plans  For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops:  Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen.  The different operations are essentially carried out together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation).

Cost Estimates for Single-Relation Plans  Index I on primary key matches selection:  Cost is Height(I)+1 for a B+ tree, about 1.2 for hash index.  Clustered index I matching one or more selects:  (NPages(I)+NPages(R)) * product of RF’s of matching selects.  Non-clustered index I matching one or more selects:  (NPages(I)+NTuples(R)) * product of RF’s of matching selects.  Sequential scan of file:  NPages(R).

Example  Given an index on rating:  (1/NKeys(I)) * NTuples(R) = (1/10) * tuples retrieved  Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(R)) = (1/10) * (50+500) pages are retrieved  Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10) * ( ) pages are retrieved  Given an index on sid:  Would have to retrieve all tuples/pages. With a clustered index, the cost is , with unclustered index,  A simple sequential scan:  We retrieve all file pages (500) SELECT S.sid FROM Sailors S WHERE S.rating=8

Queries Over Multiple Relations  Fundamental decision in System R: only left-deep join trees are considered  As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space  Left-deep trees allow us to generate all fully pipelined plans.  Intermediate results not written to temporary files  Not all left-deep trees are fully pipelined (e.g., SM join) B A C D B A C D C D B A

Enumeration of Left-Deep Plans  Left-deep plans differ only in the order of relations, the access method for each relation, the join method  Enumerated using N passes (if N relations joined):  Pass 1: Find best 1-relation plan for each relation  Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation (All 2-relation plans)  Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation (All N-relation plans)  For each subset of relations, retain only:  Cheapest plan overall, plus  Cheapest plan for each interesting order of the tuples

Enumeration of Plans (Contd.)  ORDER BY, GROUP BY, aggregates etc. handled as a final step, using either an “interestingly ordered” plan or an addional sorting operator  An (n-1)-way plan is only combined with an additional relation if there is a join condition between them, or all predicates in WHERE have been used up  i.e., avoid Cartesian products  This approach is still exponential in the # of tables  Approximately 2 n cost enumerations

Cost Estimation for Multirelation Plans  Consider a query block:  Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause.  Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size Result cardinality = Max # tuples * product of all RF’s.  Join one new relation at a time  Cost of join method, plus estimation of join cardinality gives us both cost estimate and result size estimate SELECT attribute list FROM relation list WHERE term 1 AND... AND term k

Example 1.Pass1:  Sailors: B+ tree matches rating>5, and is probably cheapest  However, if this selection retrieves many tuples and index is unclustered, sequential scan may be better  Still, B+ tree plan kept (tuples are in rating order)  Reserves: B+ tree on bid matches bid=500; cheapest 2. Pass 2:  Retrieve each plan retained from Pass 1  Consider how to join it as the outer relation with the (only) other relation  e.g., Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value Sailors: B+ tree on rating Hash on sid Reserves: B+ tree on bid Reserves Sailors sid=sid bid=100 rating > 5 sname

Query Optimization Recapped  Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries)  Two parts to optimizing a query:  Consider a set of alternative plans  Must prune search space; typically, left-deep plans only  Must estimate cost of each plan that is considered  Must estimate size of result and cost for each plan node  Key issues: Statistics, indexes, operator implementations

Plan Enumeration  All single-relation plans are first enumerated.  All access paths considered, cheapest is chosen  Selections/projections considered as early as possible.  Next, for each 1-relation plan, all ways of joining another relation (as inner) are considered.  Next, for each 2-relation plan that is “retained,” all ways of joining another relation (as inner) are considered, etc.  At each level, for each subset of relations, only best plan for each interesting order of tuples is “retained”

19 The Bigger Picture: Tuning  We saw that indexes and optimization decisions were critical to performance  Many DBAs and consultants have made a living off understanding query workloads, data, and estimated intermediate result sizes  Recent development: self-tuning DBMSs  SQL Server and DB2 “Index Wizards” take a query workload and try to find an optimal set of indices for it  “Adaptive query processing” tries to figure out where the optimizer’s estimates “went wrong” and compensate for it

20 From Queries to Updates  We’ve spent a lot of time talking about querying data  Yet updates are a really major part of many DBMS applications  Particularly important: ensuring ACID properties  Atomicity: each operation looks atomic to the user  Consistency: each operation in isolation keeps the database in a consistent state (this is the responsibility of the user)  Isolation: should be able to understand what’s going on by considering each separate transaction independently  Durability: updates stay in the DBMS!!!

21 What is a Transaction? A transaction is a sequence of read and write operations on data items that logically functions as one unit of work:  should either be done entirely or not at all  if it succeeds, the effects of write operations persist (commit); if it fails, no effects of write operations persist (abort)  these guarantees are made despite concurrent activity in the system, and despite failures that may occur

22 How Things Can Go Awry  Suppose we have a table of bank accounts which contains the balance of the account  An ATM deposit of $50 to account # 1234 would be written as:  This reads and writes the account’s balance  What if two accountholders make deposits simultaneously from two ATMs? update Accounts set balance = balance + $50 where account#= ‘1234’;

23 Concurrent Deposits This SQL update code is represented as a sequence of read and write operations on “data items” (which for now should be thought of as individual accounts): where X is the data item representing the account with account# Deposit 1 Deposit 2 read(X.bal) X.bal := X.bal + $50 X.bal:= X.bal + $10 write(X.bal)

24 A “Bad” Concurrent Execution Only one “action” (e.g. a read or a write) can actually happen at a time, and we can interleave deposit operations in many ways: Deposit 1 Deposit 2 read(X.bal) X.bal := X.bal + $50 X.bal:= X.bal + $10 write(X.bal) time BAD!

25 A “Good” Execution  Previous execution would have been fine if the accounts were different (i.e. one were X and one were Y), i.e., transactions were independent  The following execution is a serial execution, and executes one transaction after the other: Deposit 1 Deposit 2 read(X.bal) X.bal := X.bal + $50 write(X.bal) read(X.bal) X.bal:= X.bal + $10 write(X.bal) time GOOD!

26 Good Executions An execution is “good” if it is serial (transactions are executed atomically and consecutively) or serializable (i.e. equivalent to some serial execution) Equivalent to executing Deposit 1 then 3, or vice versa  Why would we want to do this instead? Deposit 1 Deposit 3 read(X.bal) read(Y.bal) X.bal := X.bal + $50 Y.bal:= Y.bal + $10 write(X.bal) write(Y.bal)

27 Atomicity Problems can also occur if a crash occurs in the middle of executing a transaction: Need to guarantee that the write to X does not persist (ABORT)  Default assumption if a transaction doesn’t commit Transfer read(X.bal) read(Y.bal) X.bal= X.bal-$100 Y.bal= Y.bal+$100 CRASH

28 Transactions in SQL  A transaction begins when any SQL statement that queries the db begins.  To end a transaction, the user issues a COMMIT or ROLLBACK statement. Transfer UPDATE Accounts SET balance = balance - $100 WHERE account#= ‘1234’; UPDATE Accounts SET balance = balance + $100 WHERE account#= ‘5678’; COMMIT;

29 Read-Only Transactions  When a transaction only reads information, we have more freedom to let the transaction execute in parallel with other transactions.  We signal this to the system by stating: SET TRANSACTION READ ONLY; SELECT * FROM Accounts WHERE account#=‘1234’;...

30 Read-Write Transactions  If we state “read-only”, then the transaction cannot perform any updates.  Instead, we must specify that the transaction may update (the default): SET TRANSACTION READ ONLY; UPDATE Accounts SET balance = balance - $100 WHERE account#= ‘1234’;... SET TRANSACTION READ WRITE; update Accounts set balance = balance - $100 where account#= ‘1234’;... ILLEGAL!

31 Dirty Reads  Dirty data is data written by an uncommitted transaction; a dirty read is a read of dirty data (WR conflict)  Sometimes we can tolerate dirty reads; other times we cannot: e.g., if we wished to ensure balances never went negative in the transfer example, we should test that there is enough money first!

32 “Bad” Dirty Read EXEC SQL select balance into :bal from Accounts where account#=‘1234’; if (bal > 100) { EXEC SQL update Accounts set balance = balance - $100 where account#= ‘1234’; EXEC SQL update Accounts set balance = balance + $100 where account#= ‘5678’; } EXEC SQL COMMIT; If the initial read (italics) were dirty, the balance could become negative!

33 Acceptable Dirty Read If we are just checking availability of an airline seat, a dirty read might be fine! (Why is that?) Reservation transaction: EXEC SQL select occupied into :occ from Flights where Num= ‘123’ and date= and seat=‘23f’; if (!occ) {EXEC SQL update Flights set occupied=true where Num= ‘123’ and date= and seat=‘23f’;} else {notify user that seat is unavailable}

34 Other Undesirable Phenomena  Unrepeatable read: a transaction reads the same data item twice and gets different values (RW conflict)  Phantom problem: a transaction retrieves a collection of tuples twice and sees different results

35 Phantom Problem Example  T1: “find the students with best grades who Take either cis550-f03 or cis570-f02”  T2: “insert new entries for student #1234 in the Takes relation, with grade A for cis570-f02 and cis550-f03”  Suppose that T1 consults all students in the Takes relation and finds the best grades for cis550-f03  Then T2 executes, inserting the new student at the end of the relation, perhaps on a page not seen by T1  T1 then completes, finding the students with best grades for cis570-f02 and now seeing student #1234

36 Isolation  The problems we’ve seen are all related to isolation  General rules of thumb w.r.t. isolation:  Fully serializable isolation is more expensive than “no isolation”  We can’t do as many things concurrently (or we have to undo them frequently)  For performance, we generally want to specify the most relaxed isolation level that’s acceptable  Note that we’re “slightly” violating a correctness constraint to get performance!

37 Specifying Acceptable Isolation Levels  The default isolation level is SERIALIZABLE (as for the transfer example).  To signal to the system that a dirty read is acceptable,  In addition, there are SET TRANSACTION READ WRITE ISOLATION LEVEL READ UNCOMMITTED; SET TRANSACTION ISOLATION LEVEL READ COMMITTED; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

38 READ COMMITTED  Forbids the reading of dirty (uncommitted) data, but allows a transaction T to issue the same query several times and get different answers  No value written by T can be modified until T completes  For example, the Reservation example could also be READ COMMITTED; the transaction could repeatably poll to see if the seat was available, hoping for a cancellation

39 REPEATABLE READ  What it is NOT: a guarantee that the same query will get the same answer!  However, if a tuple is retrieved once it will be retrieved again if the query is repeated  For example, suppose Reservation were modified to retrieve all available seats  If a tuple were retrieved once, it would be retrieved again (but additional seats may also become available)

40 Implementing Isolation Levels  One approach – use locking at some level (tuple, page, table, etc.):  each data item is either locked (in some mode, e.g. shared or exclusive) or is available (no lock)  an action on a data item can be executed if the transaction holds an appropriate lock  consider granularity of locks – how big of an item to lock  Larger granularity = fewer locking operations but more contention!  Appropriate locks:  Before a read, a shared lock must be acquired  Before a write, an exclusive lock must be acquired

41 Lock Compatibility Matrix Locks on a data item are granted based on a lock compatibility matrix: When a transaction requests a lock, it must wait (block) until the lock is granted Mode of Data Item None Shared Exclusive Shared Y Y N Exclusive Y N N Request mode {

42 Locks Prevent “Bad” Execution If the system used locking, the first “bad” execution could have been avoided: Deposit 1 Deposit 2 xlock(X) read(X.bal) {xlock(X) is not granted} X.bal := X.bal + $50 write(X.bal) release(X) xlock(X) read(X.bal) X.bal:= X.bal + $10 write(X.bal) release(X)

43 Lock Types and Read/Write Modes When we specify “read-only”, the system only uses shared-mode locks  Any transaction that attempts to update will be illegal When we specify “read-write”, the system may also acquire locks in exclusive mode  Obviously, we can still query in this mode

44 Isolation Levels and Locking READ UNCOMMITTED allows queries in the transaction to read data without acquiring any lock For updates, exclusive locks must be obtained and held to end of transaction READ COMMITTED requires a read-lock to be obtained for all tuples touched by queries, but it releases the locks immediately after the read Exclusive locks must be obtained for updates and held to end of transaction

45 Isolation levels and locking, cont. REPEATABLE READ places shared locks on tuples retrieved by queries, holds them until the end of the transaction Exclusive locks must be obtained for updates and held to end of transaction SERIALIZABLE places shared locks on tuples retrieved by queries as well as the index, holds them until the end of the transaction Exclusive locks must be obtained for updates and held to end of transaction Holding locks to the end of a transaction is called “strict” locking

46 Summary of Isolation Levels Level Dirty Read Unrepeatable Read Phantoms READ UN-MaybeMaybeMaybe COMMITTED READ NoMaybeMaybe COMMITTED REPEATABLE No NoMaybe READ SERIALIZABLENo No No