Database Administration and Performance Tuning
Text/Reference Books: Dennis Shasha and Phillipe Bonnet: Database Tuning : Principles Experiments and Troubleshooting Techniques. Morgan Kaufmann Publishers. 2002 (released in June 2002). TEXT. Dennis Shasha: Database tuning : a principled approach. Prentice Hall, 1992. REFERENCE (a good reference if cannot get the text book) Database Management Systems, 3rd edition. Raghu Ramakrishnan & Johannes Gehrke, McGraw-Hill, 2002. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom: Database Systems -- The Complete Book. Prentice Hall, 2001. G. J. Vaidyanatha, K. Deshpande and J. Kostelac: Oracle Performance Tuning 101. Osborne/Mc-Graw-Hill. 2001. REFERENCE. Jim Gray (ed): The Benchmark handbook : for database and transaction processing systems. M. Kaufmann Publishers, 1991. REFERENCE.
Database Tuning Database Tuning is the activity of making a database application run more quickly. “More quickly” usually means higher throughput, though it may mean lower response time for time-critical applications.
Hardware [Processor(s), Disk(s), Memory] Application Programmer (e.g., business analyst, Data architect) Application Sophisticated Application Programmer (e.g., SAP admin) Query Processor Indexes Storage Subsystem Concurrency Control Recovery DBA, Tuner Operating System Hardware [Processor(s), Disk(s), Memory]
Goals of the Course Appreciation of DBMS architecture Study the effect of various components on the performance of the systems Tuning principles Troubleshooting techniques for chasing down performance problems Hands-on experience in Tuning
Contents Basic Principles Tuning the guts Indexes Relational Systems Application Interface E-commerce Applications Data warehouse Applications Distributed Applications Troubleshooting
Tuning Principles Think globally, fix locally Localizing the problems Partitioning breaks bottlenecks (temporal and spatial) ONE part of the system limits the the overall performance Two approaches: Fix locally Partitioning the LOAD eg. Free list, lock contention due to long transactions Partitioning in space/logical resources/time
Tuning Principles Start-up costs are high; running costs are low Start-up costs include Disk access Data transfer Query processing System calls Reduce the number of start-ups
Rule of Random I/O: Expensive Thumb Sequential I/O: Much less An example: Time = Seek Time + Rotational Delay + Transfer Time + Other Rule of Random I/O: Expensive Thumb Sequential I/O: Much less Ex: 1 KB Block Random I/O: 20 ms. Sequential I/O: 1 ms.
Tuning Principles Render onto server what is due onto Server Task allocation between the server and the application programs Factors: Relative computing resources of client, application servers and data server Should checking be done in the middle tier? Location of information The nature of tasks: interaction with screen?
Tuning Principles Be prepared for trade-offs Ex. Indices
Tuning Mindset Set reasonable performance tuning goals Measure and document current performance Identify current system performance bottleneck Identify current OS bottleneck Tune the required components eg: application, DB, I/O, contention, OS etc Track and exercise change-control procedures Repeat step 3 through 7 until the goal is met
Schema Refinement, Normalization, and Tuning
Design Steps The design steps: 1. Real-World 2. ER model 3. Relational Schema 4. Better relational Schema 5. Relational DBMS Step (3) to step (4) is based on a “design theory” for relations and is called “normalization”. It is important for two reasons: Automatic mappings from ER to relations may not produce the best relational design possible. Database designers may go directly from (1) to (3), in which case, the relational design can be really bad.
The Evils of Redundancy Redundancy is the root of several problems associated with relational schemas: redundant storage, insert/delete/update anomalies Consider relation obtained from Hourly_Emps: Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) Notation: We will denote this relation schema by listing the attributes: SNLRWH This is really the set of attributes {S,N,L,R,W,H}. Sometimes, we will refer to all attributes of a relation by using the relation name. (e.g., Hourly_Emps for SNLRWH)
Example Problems due to R W : Update anomaly: Can we change W in just the 1st tuple of SNLRWH? Insertion anomaly: What if we want to insert a employee and don’t know the hourly wage for his rating? Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!
Refinements Integrity constraints, in particular functional dependencies, can be used to identify schemas with such problems and to suggest refinements. Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD, or ACD and ABD). Decomposition should be used judiciously: Is there reason to decompose a relation? What problems (if any) does the decomposition cause?
Functional Dependencies (FDs) A functional dependency X Y holds over relation R if, for every allowable instance r of R: i.e., given two tuples in r, if the X values agree, then the Y values must also agree. (X and Y are sets of attributes.) K is a key for relation R if: 1. K determines all attributes of R. 2. For no proper subset of K is (1) true. If K satisfies only (1), then K is a superkey. K is a candidate key for R means that K R However, K R does not require K to be minimal!
Example Consider relation Hourly_Emps: Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) FD is a key: ssn is the key S SNLRWH FDs give more detail than the mere assertion of a key. rating determines hrly_wages R W
Who Determines Keys/FDs? An FD is a statement about all allowable relations. Must be identified based on semantics of application. Given some allowable instance r1 of R, we can check if it violates some FD f, but we cannot tell if f holds over R! We can define a relation schema with a single key K. Then the only FD asserted are K A for every attribute A. Or, we can assert some FDs and deduce one or more keys or other FDs.
Reasoning About FDs Given some FDs, we can usually infer additional FDs: ssn did, did lot implies ssn lot An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold. F+ = closure of F is the set of all FDs that are implied by F. Armstrong’s Axioms (X, Y, Z are sets of attributes): Reflexivity: If Y X, then X Y Augmentation: If X Y, then XZ YZ for any Z Transitivity: If X Y and Y Z, then X Z These are sound and complete inference rules for FDs!
Reasoning About FDs (Cont.) Couple of additional rules (that follow from AA): Union: If X Y and X Z, then X YZ Decomposition: If X YZ, then X Y and X Z Proof of Union: X Y (given) X XY (augmentation using X) X Z (given) XY YZ (augmentation) X YZ (transitivity)
Reasoning About FDs (Cont.) Example: Contracts(cid,sid,jid,did,pid,qty,value) C is the key: C CSJDPQV Project purchases each part using single contract: JP C Dept purchases at most one part from a supplier: SD P JP C, C CSJDPQV imply JP CSJDPQV SD P implies SDJ JP SDJ JP, JP CSJDPQV imply SDJ CSJDPQV
Reasoning About FDs (Cont.) Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!) Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F. An efficient check: Compute attribute closure of X (denoted X+) wrt F: Set of all attributes A such that X A is in There is a linear time algorithm to compute this. Check if Y is in X+ Does F = {A B, B C, C D E } imply A E? i.e, is A E in the closure F+ ? Equivalently, is E in A+ ?
Algorithm to Compute Attribute Closure Define Y+ = closure of Y. Basis: Y+ = Y Induction: If X Y+, and X A is a given FD, then add A to Y+ End when Y+ cannot be changed. Then Y functionally determines all members of Y+, and no other attributes.
Example A B, BC D A+ = AB C+ = C (AC)+ = ABCD Thus, AC is a key.
Finding All Implied FDs Motivation: Suppose we have a relation ABCD with some FDs F. If we decide to decompose ABCD into ABC and AD, what are the FDs for ABC, AD? Example: F = AB C, C D, D A. It looks like just AB C holds in ABC, but in fact C A follows from F and applies to relation ABC. Problem is exponential in worst case. Algorithm to find F+: For each set of attributes X of R, compute X+.
Example F = AB C, C D, D A. What FDs follow? A+ = A; B+ = B (nothing) C+ = ACD (add C A) D+ = AD (nothing new) (AB)+ = ABCD (add AB D; skip all supersets of AB). (BC)+ = ABCD (nothing new; skip all supersets of BC). (BD)+ = ABCD (add BD C; skip all supersets of BD). (AC)+ = ACD; (AD)+ = AD; (CD)+ = ACD (nothing new). (ACD)+ = ACD (nothing new). All other sets contain AB, BC, or BD, so skip. Thus, the only interesting FDs that follow from F are: C A, AB D, BD C.
Projection of set of FDs If R is decomposed into X, ... projection of F onto X (denoted FX ) is the set of FDs U V in F+ (closure of F ) such that U, V are in X. Using the same example, R1(ABC): AB C, C A R2(AD): D A
A BAD Relational Schema An Improved Schema
What’s a Good Design? Three properties: No anomalies. Can reconstruct all original information. Ability to check all FDs within a single relation. Role of FDs in detecting redundancy: Consider a relation R with 3 attributes, ABC. No FDs hold: There is no redundancy here. Given A B: Several tuples could have the same A value, and if so, they’ll all have the same B value!
Decomposition of a Relation Scheme Suppose that relation R contains attributes A1 ... An. A decomposition of R consists of replacing R by two or more relations such that: Each new relation scheme contains a subset of the attributes of R (and no attributes that do not appear in R), and Every attribute of R appears as an attribute of one of the new relations. Intuitively, decomposing R means we will store instances of the relation schemes produced by the decomposition, instead of instances of R. E.g., Can decompose SNLRWH into SNLRH and RW.
Example Decomposition Decompositions should be used only when needed. SNLRWH has FDs S SNLRWH and R W W values repeatedly associated with R values. Easiest way to fix this is to create a relation RW to store these associations, and to remove W from the main schema: i.e., we decompose SNLRWH into SNLRH and RW The information to be stored consists of SNLRWH tuples. If we just store the projections of these tuples onto SNLRH and RW, are there any potential problems that we should be aware of?
Problems with Decompositions There are three potential problems to consider: Some queries become more expensive. e.g., How much did sailor Joe earn? (salary = W*H) Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation! Fortunately, not in the SNLRWH example. Checking some dependencies may require joining the instances of the decomposed relations. Tradeoff: Must consider these issues vs. redundancy.
Lossless Join Decompositions Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance r that satisfies F, “reassembling” X and Y will give R and nothing else. It is always true that reassembling X and Y gives exactly R or a superset of R. Definition extended to decomposition into 3 or more relations in a straightforward way. It is essential that all decompositions used to deal with redundancy be lossless! (Avoids Problem (2).)
More on Lossless Join The decomposition of R into X and Y is lossless-join wrt F if and only if the closure of F contains: X Y X, or X Y Y In particular, the decomposition of R into UV and R - V is lossless-join if U V holds over R.
Dependency Preserving Decomposition Consider CSJDPQV, C is key, JP C and SD P. BCNF decomposition: CSJDQV and SDP Problem: Checking JP C requires a join! Dependency preserving decomposition (Intuitive): If R is decomposed into X, Y and Z, and we enforce the FDs that hold on X, on Y and on Z, then all FDs that were given to hold on R must also hold. (Avoids Problem (3).)
Dependency Preserving Decompositions (Cont.) Decomposition of R into X and Y is dependency preserving if (FX union FY ) + = F + i.e., if we consider only dependencies in the closure F + that can be checked in X without considering Y, and in Y without considering X, these imply all dependencies in F +. Important to consider F +, not F, in this definition: ABC, A B, B C, C A, decomposed into AB and BC. Is this dependency preserving? Dependency preserving does not imply lossless join: ABC, A B, decomposed into AB and BC. And vice-versa! (Example?) Is C A preserved?????
Normal Forms Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed! If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.
Normal forms Universe of relations 1 NF 2NF 3NF BCNF 4NF 5NF
Boyce-Codd Normal Form (BCNF) Reln R with FDs F is in BCNF if, for all X A in F+ A X (called a trivial FD), or X contains a key for R. In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints. Why? Guarantees no redundancy due to FDs. Guarantees no insert/update/delete anomalies. Guarantees no loss of information. But … May destroy the ability to check FDs within a single relation
Example Consider relation Beers(name, manf, manfAddr). Not in BCNF. FDs = name manf, manf manfAddr Only key is name. manf manfAddr violates BCNF with a left side unrelated to any key. Redundancy (every manf has the same manfAddr) Update anomalies (if manf moves, all manfAddr in ALL tuples) Deletion anomalies (deleting all beers produced by a particular manf will lose info on manf and manfAddr) Not in BCNF.
Third Normal Form (3NF) Reln R with FDs F is in 3NF if, for all X A in F+ A X (called a trivial FD), or X contains a key for R, or A is part of some minimal key for R. If R is in BCNF, obviously in 3NF. If R is in 3NF, some redundancy is possible. It is a compromise, used when BCNF not achievable (e.g., no ``good’’ decomp, or performance considerations).
What Does 3NF Achieve? If 3NF violated by X A, one of the following holds: X is a subset of some key K We store (X, A) pairs redundantly. X is not a proper subset of any key. There is a chain of FDs K X A, which means that we cannot associate an X value with a K value unless we also associate an A value with an X value. But: even if reln is in 3NF, these problems could arise. e.g., Reserves SBDC, S C, C S is in 3NF, but for each reservation of sailor S, same (S, C) pair is stored. Thus, 3NF is indeed a compromise relative to BCNF.
Decomposition into BCNF Consider relation R with FDs F. If X Y violates BCNF, Expand left side to include X+. Decompose R into (R - X+) U X and X+. Find the FDs for the decomposed relations. Repeated application of this idea will give us a collection of relations that are in BCNF; lossless join decomposition, and guaranteed to terminate. In general, several dependencies may cause violation of BCNF. The order in which we ``deal with’’ them could lead to very different sets of relations!
Example R(A, C, B, D, E) F = A B, A E, C D Since AC is a key, not in BCNF. Pick A B for decomposition. Expand left side: A B E Decomposed relations: R1(A,B,E) and R2(A,C,D). Projected FDs (skipping a lot of work …) R1: A B, A E R2: C D
Example (Cont) BCNF violations? Decompose R2 For R1, A is key and all left sides are superkeys. For R2, AC is key, and C D violates BCNF. Decompose R2 R3(C,D) R4(A,C) Resulting relations are all in BCNF. R1(A,B,E)
BCNF and Dependency Preservation The example decomposition is dependency preserving! In general, there may not be a dependency preserving decomposition into BCNF. e.g., CSZ, CS Z, Z C Can’t decompose while preserving 1st FD; not in BCNF.
Decomposition into 3NF Obviously, the algorithm for lossless join decomp into BCNF can be used to obtain a lossless join decomp (not necessarily dependency preserving) into 3NF (typically, can stop earlier). There exists an algorithm that guarantees a lossless-join and dependency preserving decomp in 3NF. No such algorithm for BCNF!
Decomposition into 3NF (Cont) To ensure dependency preservation, one idea: If X Y is not preserved, add relation XY. Problem is that XY may violate 3NF! e.g., consider the addition of CJP to `preserve’ JP C. What if we also have J C ? Refinement: Instead of the given set of FDs F, use a minimal cover for F. 7
Minimal Cover for a Set of FDs Minimal cover G for a set of FDs F: Closure of F = closure of G. Right hand side of each FD in G is a single attribute. If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. Intuitively, every FD in G is needed, and ``as small as possible’’ in order to get the same closure as F. e.g., A B, ABCD E, EF GH, ACDF EG has the following minimal cover: A B, ACD E, EF G and EF H M.C. ® Lossless-Join, Dep. Pres. Decomp!!! 8
Determining a minimal cover of F Obtain a collection G of equivalent FDs with a single attribute on the right side (decomposition axiom) For each FD in G, check each attribute in the LHS to see if it can be deleted while preserving equivalence to F+ Check each remaining FD in G to see if it can be deleted while preserving equivalence to F+
Dependency Preserving Decomp into 3NF Let R be a relation, F a set of FDs that is a minimal cover, R1, …, Rn be a lossless –join decomp of R. Suppose each Ri is in 3NF, and Fi denote the projection of F onto the attributes of Ri Let N be the dependencies in F that are not preserved For each FD X A in N, create a relation XA and add it to the decomposition of R
Summary of Schema Refinement If a relation is in BCNF, it is free of redundancies that can be detected using FDs. Thus, trying to ensure that all relations are in BCNF is a good heuristic. If a relation is not in BCNF, we can try to decompose it into a collection of BCNF relations. Must consider whether all FDs are preserved. If a lossless-join, dependency preserving decomposition into BCNF is not possible (or unsuitable, given typical queries), should consider decomposition into 3NF. Decompositions should be carried out and/or re-examined while keeping performance requirements in mind.
Tuning Relational Systems
Denormalizing -- data Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); region( R_REGIONKEY, R_NAME, R_COMMENT ); nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,); supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT); 600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers
Denormalizing -- transactions lineitemdenormalized ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT, L_REGIONNAME); 600000 rows in lineitemdenormalized Cold Buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Queries on Normalized vs. Denormalized Schemas select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, R_NAME from LINEITEM, REGION, SUPPLIER, NATION where L_SUPPKEY = S_SUPPKEY and S_NATIONKEY = N_NATIONKEY and N_REGIONKEY = R_REGIONKEY and R_NAME = 'EUROPE'; select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME from LINEITEMDENORMALIZED where L_REGIONNAME = 'EUROPE';
Denormalization TPC-H schema Query: find all lineitems whose supplier is in Europe. With a normalized schema this query is a 4-way join. If we denormalize lineitem and add the name of the region for each lineitem (foreign key denormalization) throughput improves 30% 4 - Relational Systems
Schema Tuning Rule of Thumb: If ABC is normalized, and AB and AC are also normalized, then use ABC, unless: Queries very rarely access ABC, but AB or AC (80% of the time) Attribute B or C values are large.
Example Schema 1: Schema 2: R1(bond_ID, issue_date, maturity, …) R2(bond_ID, date, price) Schema 2: R1(bond_ID, issue_date, maturity, today_price, yesterday_proce,…,10dayago_price)
Indexing and Index Tuning CS5226 Indexing and Index Tuning
When Change is the Only Constant CPU Memory Speed and Size Harddisk Speed and Size Bandwidth
Moore’s Law being proved... In fact. Moore’s law will operate for many more years to come… CPUs will get faster, disks will get bigger, and so do communication speeds… To those who started their career early, you most probably recalled some of this numbers, when magnetic disk was an immature, just emerging technology. The 2nd column here provides numbers for an entry level workstation. One can easily get a server with 4 CPU, 4 Gigabytes of memory, and Plenty of harddisp space for less than 50K. The increase in speed is in a few orders of magnitude. Two interesting points to note are: Disk/memory ratio, and the factor of time taken in scanning the full disk. The number here simply tells us that the disk is getting bigger, and sequential scanning can still be very very time consuming.. The ratio tells us that while it may be possible to have terabytes of Memory in near future, it may still be NOT big enough to store the whole database! And hence a GOOD indexing structure is required!
Improvement in Performance CPU (60%/yr) 10000 1000 100 DRAM (10%/yr) Here, we see the different rates of increase in speed. Hence indexing 10 Disk (5%/yr) 1 1980 2000
Indexing Single dimensional Indexing Multi-dimensional Indexing High-dimensional Indexing Indexing for advanced applications
Single Record and Range Searches Single record retrievals ``Find student name whose matric# = 921000Y13’’ Range queries ``Find all students with cap > 3.0’’ Sequentially scanning the file is costly If data is in sorted file, do binary search to find first such student, then scan to find others. cost of binary search can still be quite high. 3
Indexes An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). e.g., consider Student(matric#, name, addr, cap), the key is matric#, but the search key can be matric#, name, addr, cap or any combination of them.
Simple Index File (Data File Sorted) Dense Index Sequential File 20 10 10 20 30 40 40 30 50 60 70 80 60 50 80 70 90 100 110 120 100 90
Simple Index File (Cont) Sparse Index Sequential File 20 10 10 30 50 70 40 30 90 110 130 150 60 50 80 70 170 190 210 230 100 90
Simple Index File (Cont) Sparse 2nd level Sequential File 20 10 10 90 170 250 10 30 50 70 40 30 90 110 130 150 330 410 490 570 60 50 80 70 170 190 210 230 100 90
Secondary indexes does not make sense! 30 50 20 70 80 40 100 10 90 60 Sequence field Sparse index does not make sense! 50 30 30 20 80 100 70 20 90 ... 40 80 10 100 60 90
Secondary indexes sparse high level 30 50 20 70 80 40 100 10 90 60 Sequence field Dense index 10 20 30 40 50 60 70 ... 50 30 10 50 90 ... sparse high level 70 20 40 80 10 100 60 90
Conventional indexes Advantages: - Simple - Index is sequential file good for scans Disadvantages: - Inserts expensive, and/or - Lose sequentiality & balance
Example overflow area (not sequential) Index(sequential) 10 39 31 35 36 32 38 34 33 overflow area (not sequential) 20 30 continuous 40 50 60 free space 70 80 90
Tree-Structured Indexing Tree-structured indexing techniques support both range searches and equality searches index file may still be quite large. But we can apply the idea repeatedly! Data pages 2
B+ Tree: The Most Widely Used Index Height-balanced. Insert/delete at log F N cost (F = fanout, N = # leaf pages); Grow and shrink dynamically. Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. `next-leaf-pointer’ to chain up the leaf nodes. Data entries at leaf are sorted. 9
Example B+ Tree Each node can hold 4 entries (order = 2) Root 17 5 13 24 30 2 3 5 7 8 14 16 19 20 22 24 27 29 33 34 38 39 6
Node structure Non-leaf nodes Leaf nodes Next leaf node index entry P K P K P K P 1 1 2 2 m m Leaf nodes P K P K P K P Next leaf node 1 1 2 m m 2 4
Searching in B+ Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM). Search for 5, 15, all data entries >= 24 ... Root 13 17 24 30 2 3 5 14 16 19 20 22 24 27 29 33 34 38 39 Based on the search for 15*, we know it is not in the tree! 10
B+-Tree Scalability Typical order: 100. Typical fill-factor: 67%. average fanout = 133 Typical capacities (root at Level 1, and has 133 entries): Level 5: 1334 = 312,900,700 records Level 4: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes
A Note on `Order’ Order (d) concept replaced by physical space criterion in practice (`at least half-full’). Index pages can typically hold many more entries than leaf pages. Variable sized records and search keys mean different nodes will contain different numbers of entries. Even with fixed length fields, multiple records with the same search key value (duplicates) can lead to variable-sized data entries 22
Inserting a Data Entry into a B+ Tree Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. Tree growth: gets wider or one level taller at top. 6
Inserting 7 & 8 into Example B+ Tree Root 13 17 24 30 2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39 (Note that 5 is copied up and Observe how minimum occupancy is guaranteed in both leaf and index pg splits. continues to appear in the leaf.) 5 13 17 24 30 2 3 5 7 8 10
Insertion (Cont) (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.) Note difference between copy-up and push-up; be sure you understand the reasons for this. 17 5 13 24 30 5 13 17 24 30 2 3 5 7 8 12
Example B+ Tree After Inserting 8 Root 17 5 13 24 30 2 3 5 7 8 14 16 19 20 22 24 27 29 33 34 38 39 Notice that root was split, leading to increase in height. In this example, we can avoid splitting by re-distributing entries; however, this is usually not done in practice. Why? 13
Deleting a Data Entry from a B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height. 14
Example Tree After (Inserting 8, Then) Deleting 19 Root 17 5 13 24 30 2 3 5 7 8 14 16 20 22 24 27 29 33 34 38 39 Deleting 19 is easy. 15
Example Tree After Deleting 20 ... Root 17 5 13 27 30 2 3 5 7 8 14 16 22 24 27 29 33 34 38 39 Deleting 20 is done with re-distribution. Notice how middle key is copied up. 15
... And Then Deleting 24 Must merge. Observe `toss’ of index entry (on right), and `pull down’ of index entry (below). 30 22 27 29 33 34 38 39 Root 5 13 17 30 2 3 5 7 8 14 16 22 27 29 33 34 38 39 16
Example of Non-leaf Re-distribution (Delete 24) 22 Root 5 13 17 20 27 30 22 24 27 29 33 34 38 39 2 3 5 7 8 14 16 17 18 20 21 In contrast to previous example, can re-distribute entry from left child of root to right child. 22 Root 30 5 13 17 20 14 16 17 18 20 33 34 38 39 22 27 29 21 7 5 8 3 2 17
After Re-distribution Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node. It suffices to re-distribute index entry with key 20; we’ve re-distributed 17 as well for illustration. Root 17 5 13 20 22 30 2 3 5 7 8 14 16 17 18 20 21 22 27 29 33 34 38 39 18
Index Classification Primary vs. secondary: If search key contains primary key, then called primary index. Unique index: Search key contains a candidate key. Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not!
Clustered vs. Unclustered Index Suppose the data file is unsorted. To build clustered index, first sort the data file (with some free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records
Index Classification (Cont.) Dense vs. Sparse: If there is at least one data entry per search key value (in some data record), then dense. Every sparse index is clustered! Sparse indexes are smaller. Ashby, 25, 3000 22 Basu, 33, 4003 25 Bristow, 30, 2007 30 Ashby 33 Cass Cass, 50, 5004 Smith Daniels, 22, 6003 40 Jones, 40, 6003 44 44 Smith, 44, 3000 50 Tracy, 44, 5004 Sparse Index Dense Index on on Data File Name Age
Index Classification (Cont.) Composite Search Keys: Search on a combination of fields. Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: age=20 & sal =75 Range query: Some field value is not a constant. E.g.: age =20; or age=20 & sal > 10 Data entries in index sorted by search key to support range queries. Lexicographic order, or Spatial order. Examples of composite key indexes using lexicographic order. 11,80 11 12,10 12 name age sal 12,20 12 13,75 bob 12 10 13 <age, sal> cal 11 80 <age> joe 12 20 10,12 sue 13 75 10 20,12 Data records sorted by name 20 75,13 75 80,11 80 <sal, age> <sal> Data entries in index sorted by <sal,age> Data entries sorted by <sal>
Summary Tree-structured indexes are ideal for range-searches, also good for equality searches. B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced; log F N cost. High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file. 23
Summary (Cont.) Typically, 67% occupancy on average. Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. If data entries are data records, splits can change rids! Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse, simple vs composite 24
New Database Challenges More Complex applications. Eg. GIS, OLAP, Mobile Hardware Advances: Big and Small Web and Internet. Eg XML Not only have the hardware advanced, database systems have too evolved to handle more complex applications such as GIS and OLAP applications. They have to scale up for megeservers and scale down for appliances such as embedded systems and PDAs. The web and internet present another dimension of problems – the web itself is ONE huge database with semistructured data. All these have invalidated some assumptions and design decisions in existing DBMS technology. And whenever, a new application emerges, many issues have to be re-examined. One of which is the design of indexing structure.
New Index? A most effective mechanism to prune the search Order of magnitude of difference between I/O and CPU cost Increasing data size Increasing complexity of data and search Why new indexes? Index is a the most effective mechanism to prune the search, and I/O (apart from main memory database) is still the dominant cost factor, And as mentioned the data size is forever increasing, and so is the complexity of the search. However, with all these changes, one thing remains unchanged!
Something that Transcends Time…B+-tree B+-tree, was the indexing structure proposed to solve the disk I/O and memory problem in the 70’s, And it is still the index used today to reduce I/O cost, and to avoid bringing in unnecessary pages of unwanted data! It has NOT changed, but the data types and applications built on top of B+-trees have increased over the years. In fact, if B+-tree were the real tree, it would have grown many other fruits without changing the shape of the tree! Why B-trees? Two reasons: b-tree is widely deployed in commercial database systems, and most concurrency control and buffer management strategies can be reused. 2) Solutions based on the b-trees are easy to implement and understand.
Success Factors Robustness Concurrency Performance Scalability Fundamentals of Building DBMS The success of the B+-tree is NOT accidental! The B+-tree has superb properties with respect to the 4 criteria in constructing database systems. The index is simple in design, and robust, it grows and shrinks dynamically. With appropriate concurrency, it supports high degree of concurrency. It is efficient for both exact match and range queries, and it is not expensive to maintain! Most importantly, it is scalable – it can support a huge amount of data without choking the system to death.
B-tree B+-trees forever? Can the B+-tree being a single-dimensional index be used for emerging applications such as: Spatial databases High-dimensional databases Temporal databases Main memory databases String databases Genomic/sequence databases ...
Multi-Dimensional Indexing CS5226 Multi-Dimensional Indexing
What is a Spatial Database? A Spatial DBMS is a DBMS It offers spatial data types/data models/ query language Support spatial properties/operations It supports spatial data types in its implementation Support spatial indexing, algorithms for spatial selection and join based on spatial relationships
Applications Geographical Information Systems (GIS): dealing extensively with spatial data. Eg. Map system, resource management systems Computer-aided design and manufacturing (CAD/CAM): dealing mainly with surface data. Eg. design systems. Multimedia databases: storing and manipulating characteristics of MM objects.
Spatial Data Examples of non-spatial data Examples of Spatial data Names, zip-codes … Examples of Spatial data Census Data NASA satellites imagery Weather and climate Data Rivers, farms, ecological impact Medical Imaging
Spatial Databases Spatial Objects: Data Types: Points: spatial location: eg. feature vectors Lines: set of points: eg. roads, coastal line Polygons: set of points: eg. Buildings, lakes Data Types: Point: a spatial data object with no extension no size or volume Region:a spatial object with a location and a boundary that defines the extension
Spatial Relationships Topological relationships: adjacent, within/contain, intersect, disjoint, etc Direction relationships: Above, below, north-of, south-of,etc Metric relationships: “distance < 100 km” And operations to express the relationships
Spatial Queries Range queries: “Find all cities within 50 km of Madras?” Nearest neighbor queries: “Find the 5 cities that are nearest to Madras?” “Find the 10 images most similar to this image?” Spatial join queries: “Find pairs of cities within 200 km of each other?’
More Examples Window Range Query: “Find me data points that satisfy the conditions x1 <A1 < y1, x2 <A2 <y2…?” Spatial Query: “Find me buildings that are adjacent to the Railway Stations?” Nearest Neighbour Query: “Find me the nearest fire station to Clementi Ave. 3?”
Spatial Representation Raster model: Vector model:
Representation of Spatial Objects Testing on real objects is expensive Minimum Bounding Box/Rectangle How to test if 2-d rectangles intersect? y2 y1 x1 x2 representation testing
Query Operation & Spatial Index Filter Step: Select the objects whose mbr satisfies the spatial predicate Traverse the index and apply the spatial test on the mbrs indexed by the index Output: set of oids (including negatives) Refinement Step: Spatial test is done on the actual geometries of objects whose mbr satisfied the filter step (output) Costly operation Executed only on a limited number of objects
Why spatial index methods (SAMs)? B-tree & hash tables Guarantee the number of I/O operations is respectively logarithmic and constant with respect to the collection’s size Index a collection on a key Rely on a total order on the key domain, the order of natural numbers, or the lexicographic order on strings There is no such total order for geometric objects with spatial extent SAMs were designed to try as much as possible to preserve spatial object proximity
Approaches to the Design of SAMs Approaches to the Design of SAMs Space-Based structures: Partition the embedding Space into rectangular cells Independent from the distribution of the objects Objects are mapped to the cells based on some geometric criterion Eg: Grid file, Buddy-tree, KDB-tree Data-Based structures: Organize by partitioning the set of objects based on spatial proximity such that each group can be fit into a page, as opposed to the embedding space Adapt to the objects’ distribution in the embedding space Eg. R-tree, R* tree, R+ tree Mapping
The R-tree A leaf entry is a pair (mbr, oid) The R-tree A leaf entry is a pair (mbr, oid) A non-leaf node contains an array of node entries The number of entries is between m (fill-factor) and M For each entry (mbr, nodeid) in a non-leaf node N, mbr is the directory rectangle of a child node of N, whose page address is nodeid All leaves are at the same level An object appears in one, and only one of the tree leaves
R-trees A B A B Height balanced tree Problem: The R-tree is the most well known index for handling spatial data. Although, it is a multi-dimensional index, the design borrows many fundamental principles from the B-trees. It is a height-balanced tree. An entry in the internal node contains a bounding box that bounds all objects in the subtree, and a pointer pointing to the subtree. The bounding boxes are allowed to overlap, and this is one of the optiimization problem of the R-tree. Problem: Overlap of covering rectangles. B
Insertion in the R-Tree Algorithm ChooseSubtree CS1 [Initialize] Set N to be the root node CS2 [Leaf check] If N is a leaf, return N else [Choose subtree] Choose the entry in N whose rectangle needs least area enlargement to include the new data. Resolve ties by choosing the entry with the rectangle of smallest area end CS3 [Descend until a leaf is reached] Set N to be the childnode pointed to by the childpointer of the chosen entry. Repeat from CS2
Splitting Strategies in the R-Tree Three versions all are designed to minimize area covered by two covering rectangles resulting from split Exponential find the area with global minimum CPU cost is too high Quadratic and Linear find approximation Quadratic performs much better than linear
Splitting Strategies in the R-Tree Algorithm QuadraticSplit [Divide a set of M+1 index entries into two groups] QS1 [Pick first entry for each group ] Invoke PickSeeds to choose two entries, each be first entry of each group QS2 [Check if done] Repeat DistributeEntry until all entries are distributed or one of the two groups has Mm+1 entries (so that the other group has m entries) QS3 [Select entry to assign ] If entries remain, assign them to the other group so that it has the minimum number m required
Splitting in the R-Tree Algorithm PickSeeds [Choose two entries to be the first entries of the groups] PS1 [Calculate inefficiency of grouping entries together] For each pair of entries E1 and E2, compose a rectangle R including E1 rectangle and E2 rectangle Calculate d = area(R) - area(E1 rectangle) - area(E2 rectangle) PS2 [Choose the most wasteful pair ] Choose the pair with the largest d [the seeds will tend to be small, if the rectangles are of very different size (and) or the overlap between them is high]
Splitting in the R-Tree Algorithm DistributeEntry [Assign the remaining entries by the criterion of minimum area] DE1 Invoke PickNext to choose the next entry to be assigned DE2 Add It to the group whose covering rectangle will have to be enlarged least to accommodate It. Resolve ties by adding the entry to the group with the smallest area, then to the one with the fewer entries, then to either Algorithm PickNext [chooses the entry with best area-goodness-value in every situation] DE1 For each entry E not yet in a group, calculate d1 = the area increase required in the covering rectangle of Group 1 to include E Rectangle. Calculate d2 analogously for Group 2 DE2 Choose the entry with the maximum difference between d1 and d2
Node Splitting R-trees The R-tree is the most well known index for handling spatial data. Although, it is a multi-dimensional index, the design borrows many fundamental principles from the B-trees. It is a height-balanced tree. An entry in the internal node contains a bounding box that bounds all objects in the subtree, and a pointer pointing to the subtree. The bounding boxes are allowed to overlap, and this is one of the optiimization problem of the R-tree.
Node Splitting R-trees The R-tree is the most well known index for handling spatial data. Although, it is a multi-dimensional index, the design borrows many fundamental principles from the B-trees. It is a height-balanced tree. An entry in the internal node contains a bounding box that bounds all objects in the subtree, and a pointer pointing to the subtree. The bounding boxes are allowed to overlap, and this is one of the optimization problem of the R-tree.
R-trees Range Query Insert Delete Variants: R+-tree Node splitting Optimization Coverage Overlap Delete Variants: R+-tree R*-tree, buddy-tree
The R*-Tree A variant of R-Tree Several improvements to the insertion algorithm Aim at optimizing Node overlapping Area covered by a node Perimeter of a node’s directory rectangle Given a fixed area, the shape that minimizes the rectangles perimeter is the square Two variants that bring the most significant improvement Split Algorithm Forced Reinsertion Strategy
The R+ Tree The directory rectangles at a given level do not overlap For a point query, a single path is followed from the root to a leaf; for a region query, subtrees whose covering mbr intersecting the query region is traversed The I/O complexity is bounded by the depth of the tree Dead space problem
Index Tuning Index issues Indexes may be better or worse than scans Multi-table joins that run on for hours, because the wrong indexes are defined Concurrency control bottlenecks Indexes that are maintained and never used 3 - Index Tuning
Information about indexes... Application codes V$SQLAREA -- look for the one with high # of executions INDEX_STATS: meta information about indexes HASH_AREA_SIZE HASH_MULTIBLOCK_IO_COUNT …home work
Clustered / Non clustered index Clustered index (primary index) A clustered index on attribute X co-locates records whose X values are near to one another. Non-clustered index (secondary index) A non clustered index does not constrain table organization. There might be several non-clustered indexes per table. Records Records
Dense / Sparse Index Sparse index Dense index P1 P2 Pi Pointers are associated to pages Dense index Pointers are associated to records Non clustered indexes are dense P1 P2 Pi record record record 3 - Index Tuning
Index Implementations in some major DBMS SQL Server B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are performed DB2 B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization Oracle B+-tree, hash, bitmap, spatial extender for R-Tree clustered index Index organized table (unique/clustered) Clusters used when creating tables. TimesTen (Main-memory DBMS) T-tree 3 - Index Tuning
Types of Queries Point Query SELECT balance FROM accounts WHERE number = 1023; Multipoint Query SELECT balance FROM accounts WHERE branchnum = 100; Range Query SELECT number FROM accounts WHERE balance > 10000 and balance <= 20000; Prefix Match Query SELECT * FROM employees WHERE name = ‘J*’ ; 3 - Index Tuning
More Types of Queries Extremal Query SELECT * FROM accounts WHERE balance = max(select balance from accounts) Ordering Query SELECT * FROM accounts ORDER BY balance; Grouping Query SELECT branchnum, avg(balance) FROM accounts GROUP BY branchnum; Join Query SELECT distinct branch.adresse FROM accounts, branch WHERE accounts.branchnum = branch.number and accounts.balance > 10000; 3 - Index Tuning
Index Tuning -- data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); clustered index c on employees(hundreds1) with fillfactor = 100; nonclustered index nc on employees (hundreds2); index nc3 on employees (ssnum, name, hundreds2); index nc4 on employees (lat, ssnum, name); 1000000 rows ; Cold buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Index Tuning -- operations Update: update employees set name = ‘XXX’ where ssnum = ?; Insert: insert into employees values (1003505,'polo94064',97.48,84.03,4700.55,3987.2); Multipoint query: select * from employees where hundreds1= ?; select * from employees where hundreds2= ?; Covered query: select ssnum, name, lat from employees; Range Query: select * from employees where long between ? and ?; Point Query: select * from employees where ssnum = ?
Clustered Index Multipoint query that returns 100 records out of 1000000. Cold buffer Clustered index is twice as fast as non-clustered index and orders of magnitude faster than a scan. 3 - Index Tuning
Positive Points of Clustering indexes If the index is sparse, it has less points --less I/Os Good for multipoint queries eg. Looking up names in telephone dir. Good for equijoin. Why? Good for range, prefix match, and ordering queries
Index “Face Lifts” Index is created with fillfactor = 100. Insertions cause page splits and extra I/O for each query Maintenance consists in dropping and recreating the index With maintenance performance is constant while performance degrades significantly if no maintenance is performed. 3 - Index Tuning
Index Maintenance In Oracle, clustered index are approximated by an index defined on a clustered table No automatic physical reorganization Index defined with pctfree = 0 Overflow pages cause performance degradation 3 - Index Tuning
Covering Index - defined Select name from employee where department = “marketing” Good covering index would be on (department, name) Index on (name, department) less useful. Index on department alone moderately useful. 3 - Index Tuning
Covering Index - impact Covering index performs better than clustering index when first attributes of index are in the where clause and last attributes in the select. When attributes are not in order then performance is much worse. 3 - Index Tuning
Positive/negative points of non-clustering indexes Eliminate the need to access the underlying table eg. Index on (A, B, C) Select B,C From R Where A=5. Good if each query retrieves significantly fewer records than there are pages in DB May not be good for multipoint queries
Examples: Table T has 50-bytes records and attribute A has 20 different values which are uniformly distributed. Page size=4K. Is a nonclustering index on A any good? Now the record size is 2Kbytes.
Scan Can Sometimes Win IBM DB2 v7.1 on Windows 2000 Range Query If a query retrieves 10% of the records or more, scanning is often better than using a non-clustering non-covering index. Crossover > 10% when records are large or table is fragmented on disk – scan cost increases. 3 - Index Tuning
Index on Small Tables Small table: 100 records, i.e., a few pages. Two concurrent processes perform updates (each process works for 10ms before it commits) No index: the table is scanned for each update. No concurrent updates. A clustered index allows to take advantage of row locking. 3 - Index Tuning
Bitmap vs. Hash vs. B+-Tree Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); create cluster c_hundreds (hundreds2 number(8)) PCTFREE 0; create cluster c_ssnum(ssnum integer) PCTFREE 0 size 60; create cluster c_hundreds(hundreds2 number(8)) PCTFREE 0 HASHKEYS 1000 size 600; create cluster c_ssnum(ssnum integer) PCTFREE 0 HASHKEYS 1000000 SIZE 60; create bitmap index b on employees (hundreds2); create bitmap index b2 on employees (ssnum); 1000000 rows ; Cold buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Multipoint query: B-Tree, Hash Tree, Bitmap There is an overflow chain in a hash index In a clustered B-Tree index records are on contiguous pages. Bitmap is proportional to size of table and non-clustered for record access. 3 - Index Tuning
B-Tree, Hash Tree, Bitmap Hash indexes don’t help when evaluating range queries Hash index outperforms B-tree on point queries 3 - Index Tuning
Summary Primary means to reduce search costs (I/O and CPU) Properties: robust, concurrent, scalable, efficient Most supported indexes: Hash, B+-trees, bitmap index, and R-trees Tuning: Usage, Maintenance, Drop/Rebuild, index locking in buffer...
Principles of Query Processing
Application Programmer (e.g., business analyst, Data architect) Application Sophisticated Application Programmer (e.g., SAP admin) Query Processor Indexes Storage Subsystem Concurrency Control Recovery DBA, Tuner Operating System Hardware [Processor(s), Disk(s), Memory]
Overview of Query Processing Database Statistics Cost Model Query Optimizer Query Evaluator Parsed Query QEP Parser High Level Query Query Result
Outline Processing relational operators Query optimization Performance tuning
Projection Operator R.attrib, .. (R) SELECT bid FROM Reserves R WHERE R.rname < ‘C%’ R.attrib, .. (R) Implementation is straightforward
Selection Operator R.attr op value (R) FROM Reserves R WHERE R.rname < ‘C%’ R.attr op value (R) Size of result = R * selectivity Scan Clustered index: Good Non-clustered index: Good for low selectivity Worse than scan for high selectivity
Example of Join SELECT * FROM Sailors R, Reserve S WHERE R.sid=S.sid
Notations |R| = number of pages in outer table R ||R|| = number of tuples in outer table R |S| = number of pages in inner table S ||S|| = number of tuples in inner table S M = number of main memory pages allocated
Simple Nested Loop Join R S 1 scan per R tuple |S| pages per scan Tuple ||R|| tuples
Simple Nested Loop Join Scan inner table S per R tuple: ||R|| * |S| Each scan costs |S| pages For ||R|| tuples |R| pages for outer table R Total cost = |R| + ||R|| * |S| pages Not optimal!
Block Nested Loop Join R S 1 scan per R block |S| pages per scan M – 2 pages |R| / (M – 2) blocks
Block Nested Loop Join Scan inner table S per block of (M – 2) pages of R tuples Each scan costs |S| pages |R| / (M – 2) blocks of R tuples |R| pages for outer table R Total cost = |R| + |R| / (M – 2) * |S| pages R should be the smaller table
Index Nested Loop Join R Index S 1 probe per R tuple Tuple ||R|| tuples
Index Nested Loop Join Probe S index for matching S tuples per R tuple Probe hash index: 1.2 I/Os Probe B+ tree: 2-4 I/Os, plus retrieve matching S tuples: 1 I/O For ||R|| tuples |R| pages for outer table R Total cost = |R| + ||R|| * index retrieval Better than Block NL join only for small number of R tuples
Sort Merge Join External sort R External sort S Merge sorted R and sorted S
External Sort R Merge pass 2 R2,1 Merge pass 1 R1,1 R1,2 R1,M-1 … (m-1)-way merge Split pass R R0,1 R0,M-1 R0,M … Size of R0,i = M, # R0,i’s = |R|/M # merge passes = logM-1 |R|/M Cost per pass = |R| input + |R| output = 2 |R| Total cost = 2 |R| (logM-1 |R|/M + 1) including split pass
External Sorting e.g., find students in increasing cap order A classic problem in computer science! Data requested in sorted order e.g., find students in increasing cap order Sorting is used in many applications First step in bulk loading operations. Sorting useful for eliminating duplicate copies in a collection of records (How?) Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of RAM.
2-Way Sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. only one buffer page is used Pass 2, 3, …, etc.: three buffer pages used. INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk
Two-Way External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Idea: Divide and conquer: sort subfiles and merge PASS 0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs PASS 1 2,3 4,7 1,3 2-page runs 4,6 8,9 5,6 2 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 8,9 6 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9
General External Merge Sort More than 3 buffer pages. How can we utilize them? To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs. INPUT 1 . . . INPUT 2 . . . . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers
Cost of External Merge Sort Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page file: Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages
Number of Passes of External Sort
Sequential vs Random I/Os Transfer rate increases 40% per year; seek time and latency time decreases by only 8% per year Is minimizing passes optimal? Would merging as many runs as possible the best solution? Suppose we have 80 runs, each 80 pages long and we have 81 pages of buffer space. We can merge all 80 runs in a single pass each page requires a seek to access (Why?) there are 80 pages per run, so 80 seeks per run total cost = 80 runs X 80 seeks = 6,400 seeks
Sequential vs Random I/Os (Cont) We can merge all 80 runs in two steps 5 sets of 16 runs each read 80/16=5 pages of one run 16 runs result in sorted run of 1280 pages each merge requires 80/5X16 = 256 seeks for 5 sets, we have 5X256 = 1280 seeks merge 5 runs of 1280 pages read 80/5=16 pages of one run => 1280/16=80 seeks in total 5 runs => 5X80 = 400 seeks total: 1280+400=1680 seeks!!! Number of passes increases, but number of seeks decreases!
Sort Merge Join External-sort R: 2 |R| * (logM-1 |R|/M + 1) Split R into |R|/M sorted runs each of size M: 2 |R| Merge up to (M – 1) runs repeatedly logM-1 |R|/M passes, each costing 2 |R| External-sort S: 2 |S| * (logM-1 |S|/M + 1) Merge matching tuples from sorted R and S: |R| + |S| Total cost = 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-1 |S|/M + 1) + |R| + |S| If |R| < M*(M-1), cost = 5 * (|R| + |S|)
GRACE Hash Join R S = R0 S0 + R1 S1 + R2 S2 + R3 S3 S 0 1 2 3 X X X 0 1 2 3 X X X bucketID = X mod 4 Join on R.X = S.X 1 R S = R0 S0 + R1 S1 + R2 S2 + R3 S3 R 2 3
GRACE Hash Join – Partition Phase M main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h1 M-1 Partitions . . . R (M – 1) partitions, each of size |R| / (M – 1)
GRACE Hash Join – Join Phase Partitions of R & S Input buffer for Si Hash table for partition Ri (< M-1 pages) B main memory buffers Disk Output buffer Join Result hash fn h2 Partition must fit in memory: |R| / (M – 1) < M -1
GRACE Hash Join Algorithm Partition phase: 2 (|R| + |S|) Partition table R using hash function h1: 2 |R| Partition table S using hash function h1: 2 |S| R tuples in partition i will match only S tuples in partition I R (M – 1) partitions, each of size |R| / (M – 1) Join phase: |R| + |S| Read in a partition of R (|R| / (M – 1) < M -1) Hash it using function h2 (<> h1!) Scan corresponding S partition, search for matches Total cost = 3 (|R| + |S|) pages Condition: M > √f|R|, f ≈ 1.2 to account for hash table
Summary of Join Operator Simple nested loop: |R| + ||R|| * |S| Block nested loop: |R| + |R| / (M – 2) * |S| Index nested loop: |R| + ||R|| * index retrieval Sort-merge: 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-1 |S|/M + 1) + |R| + |S| GRACE hash: 3 * (|R| + |S|) Condition: M > √f|R|
Overview of Query Processing Database Statistics Cost Model Query Optimizer Query Evaluator Parsed Query QEP Parser High Level Query Query Result
Query Rewriting A query can be expressed in many forms, with some being more efficient than others. Example: S, P, SP relations Select Distinct S.sname From S Where S.s# IN (Select SP.s# From SP Where SP.p# = ‘P2’) From S, SP Where S.s# = SP.s# AND SP.p# = ‘P2’ Select Distinct S.sname From S Where ‘P2’ IN (Select SP.p# From SP Where SP.p# = S.s#)
Select Distinct S.sname From S Where S.s# = ANY (Select SP.s# From SP Where SP.p# = ‘P2’) Where EXISTS (Select * Where SP.s# = S.s# And SP.p# = ‘P2’) Select Distinct S.sname From S Where 0 < (Select Count(*) From SP Where SP.s# = S.s# And SP.p# = ‘P2’) Select S.sname From S, SP And SP.p# = ‘P2’) Group by S.sname
Query Optimization R.bid=100 AND S.rating>5 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Reserves Sailors sid=sid bid=100 rating > 5 sname Given: An SQL query joining n tables Dream: Map to most efficient plan Reality: Avoid rotten plans State of the art: Most optimizers follow System R’s technique Works fine up to about 10 joins
Complexity of Query Optimization Many degrees of freedom Selection: scan versus (clustered, non-clustered) index Join: block nested loop, sort-merge, hash Relative order of the operators Exponential search space! Heuristics Push the selections down Push the projections down Delay Cartesian products System R: Only left-deep trees B A C D
Equivalences in Relational Algebra Selection: - cascade - commutative Projection: - cascade Join: - associative R (S T) (R S) T (R S) (S R)
Equivalences in Relational Algebra A projection commutes with a selection that only uses attributes retained by the projection Selection between attributes of the two arguments of a cross-product converts cross-product to a join A selection on just attributes of R commutes with join R S (i.e., (R S) (R) S ) Similarly, if a projection follows a join R S, we can `push’ it by retaining only attributes of R (and S) that are needed for the join or are kept by the projection
System R Optimizer Find all plans for accessing each base table For each table Save cheapest unordered plan Save cheapest plan for each interesting order Discard all others Try all ways of joining pairs of 1-table plans; save cheapest unordered + interesting ordered plans Try all ways of joining 2-table with 1-table Combine k-table with 1-table till you have full plan tree At the top, to satisfy GROUP BY and ORDER BY Use interesting ordered plan Add a sort node to unordered plan
Source: Selinger et al, “Access Path Selection in a Relational Database Management System”
Search Strategies for Single Relations
Note: Only branches for NL join are shown here. Additional branches for other join methods (e.g. sort-merge) are not shown. Source: Selinger et al, “Access Path Selection in a Relational Database Management System”
What is “Cheapest”? Need information about the relations and indexes involved Catalogs typically contain at least: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree index. Catalogs updated periodically. Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. More detailed information (e.g., histograms of the values in some field) are sometimes stored.
Estimating Result Size SELECT attribute list FROM relation list WHERE term1 AND ... AND termk Consider a query block: Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. Reduction factor (RF) associated with each termi reflects the impact of the term in reducing result size Term col=value has RF 1/NKeys(I) Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2)) Term col>value has RF (High(I)-value)/(High(I)-Low(I)) Result cardinality = Max # tuples * product of all RF’s. Implicit assumption that terms are independent!
Cost Estimates for Single-Table Plans Index I on primary key matches selection: Cost is Height(I)+1 for a B+ tree, about 1.2 for hash index. Clustered index I matching one or more selects: (NPages(I)+NPages(R)) * product of RF’s of matching selects. Non-clustered index I matching one or more selects: (NPages(I)+NTuples(R)) * product of RF’s of matching selects. Sequential scan of file: NPages(R). Note: Typically, no duplicate elimination on projections! (Exception: Done on answers if user says DISTINCT.)
Counting the Costs R.bid=100 AND S.rating>5 With 5 buffers, cost of plan: Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform distribution) Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings). Sort T1 (2*10*2), sort T2 (2*250*4), merge (10+250), total=2300 Total: 4060 page I/Os If we used BNL join, join cost = 10+4*250, total cost = 2770 If we ‘push’ projections, T1 has only sid, T2 only sid and sname: T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Reserves Sailors sid=sid bid=100 sname (On-the-fly) rating > 5 (Scan; write to temp T1) temp T2) (Sort-Merge Join)
Exercise Reserves: 100,000 tuples, 100 tuples per page With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages Join column sid is a key for Sailors - at most one matching tuple Decision not to push rating>5 before the join is based on availability of sid index on Sailors Cost: Selection of Reserves tuples (10 I/Os); for each tuple, must get matching Sailors tuple (1000*1.2); total 1210 I/Os (On-the-fly) sname (On-the-fly) rating > 5 (Index Nested Loops, sid=sid with pipelining ) bid=100 Sailors (Use hash Index on sid) (Use clustered index on sid) Reserves
Query Tuning
Avoid Redundant DISTINCT SELECT DISTINCT ssnum FROM Employee WHERE dept = ‘information systems’ DISTINCT usually entails a sort operation Slow down query optimization because one more “interesting” order to consider Remove if you know the result has no duplicates
Change Nested Queries to Join SELECT ssnum FROM Employee WHERE dept IN (SELECT dept FROM Techdept) Might not use index on Employee.dept Need DISTINCT if an employee might belong to multiple departments SELECT ssnum FROM Employee, Techdept WHERE Employee.dept = Techdept.dept
Avoid Unnecessary Temp Tables SELECT * INTO Temp FROM Employee WHERE salary > 40000 SELECT ssnum FROM Temp WHERE Temp.dept = ‘information systems’ Creating temp table causes update to catalog Cannot use any index on original table SELECT ssnum FROM Employee WHERE Employee.dept = ‘information systems’ AND salary > 40000
Avoid Complicated Correlation Subqueries SELECT ssnum FROM Employee e1 WHERE salary = (SELECT MAX(salary) FROM Employee e2 WHERE e2.dept = e1.dept Search all of e2 for each e1 record! SELECT MAX(salary) as bigsalary, dept INTO Temp FROM Employee GROUP BY dept SELECT ssnum FROM Employee, Temp WHERE salary = bigsalary AND Employee.dept = Temp.dept
Avoid Complicated Correlation Subqueries SQL Server 2000 does a good job at handling the correlated subqueries (a hash join is used as opposed to a nested loop between query blocks) The techniques implemented in SQL Server 2000 are described in “Orthogonal Optimization of Subqueries and Aggregates” by C.Galindo-Legaria and M.Joshi, SIGMOD 2001. > 1000 > 10000
Join on Clustering and Integer Attributes SELECT Employee.ssnum FROM Employee, Student WHERE Employee.name = Student.name Employee is clustered on ssnum ssnum is an integer SELECT Employee.ssnum FROM Employee, Student WHERE Employee.ssnum = Student.ssnum
Avoid HAVING when WHERE is enough SELECT AVG(salary) as avgsalary, dept FROM Employee GROUP BY dept HAVING dept = ‘information systems’ May first perform grouping for all departments! SELECT AVG(salary) as avgsalary FROM Employee WHERE dept = ‘information systems’ GROUP BY dept
Avoid Views with unnecessary Joins CREATE VIEW Techlocation AS SELECT ssnum, Techdept.dept, location FROM Employee, Techdept WHERE Employee.dept = Techdept.dept SELECT dept FROM Techlocation WHERE ssnum = 4444 Join with Techdept unnecessarily SELECT dept FROM Employee WHERE ssnum = 4444
Aggregate Maintenance Materialize an aggregate if needed “frequently” Use trigger to update create trigger updateVendorOutstanding on orders for insert as update vendorOutstanding set amount = (select vendorOutstanding.amount+sum(inserted.quantity*item.price) from inserted,item where inserted.itemnum = item.itemnum ) where vendor = (select vendor from inserted) ;
Avoid External Loops No loop: Loop: sqlStmt = “select * from lineitem where l_partkey <= 200;” odbc->prepareStmt(sqlStmt); odbc->execPrepared(sqlStmt); Loop: sqlStmt = “select * from lineitem where l_partkey = ?;” for (int i=1; i<200; i++) { odbc->bindParameter(1, SQL_INTEGER, i); }
Avoid External Loops Let the DBMS optimize set operations SQL Server 2000 on Windows 2000 Crossing the application interface has a significant impact on performance
Avoid Cursors No cursor Cursor select * from employees; Cursor DECLARE d_cursor CURSOR FOR select * from employees; OPEN d_cursor while (@@FETCH_STATUS = 0) BEGIN FETCH NEXT from d_cursor END CLOSE d_cursor go
Avoid Cursors SQL Server 2000 on Windows 2000 Response time is a few seconds with a SQL query and more than an hour iterating over a cursor
Retrieve Needed Columns Only All Select * from lineitem; Covered subset Select l_orderkey, l_partkey, l_suppkey, l_shipdate, l_commitdate from lineitem; Avoid transferring unnecessary data May enable use of a covering index.
Use Direct Path for Bulk Loading sqlldr directpath=true control=load_lineitem.ctl data=E:\Data\lineitem.tbl load data infile "lineitem.tbl" into table LINEITEM append fields terminated by '|' ( L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE "YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD", L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT )
Use Direct Path for Bulk Loading Direct path loading bypasses the query engine and the storage manager. It is orders of magnitude faster than for conventional bulk load (commit every 100 records) and inserts (commit for each record).
Some Idiosyncrasies OR may stop the index being used break the query and use UNION Order of tables may affect join implementation
Query Tuning – Thou Shalt … Avoid redundant DISTINCT Change nested queries to join Avoid unnecessary temp tables Avoid complicated correlation subqueries Join on clustering and integer attributes Avoid HAVING when WHERE is enough Avoid views with unnecessary joins Maintain frequently used aggregates Avoid external loops
Query Tuning – Thou Shalt … Avoid cursors Retrieve needed columns only Use direct path for bulk loading
Principles of Query Processing CS5226 Week 5 Principles of Query Processing Pang Hwee Hwa School of Computing, NUS H. Pang / NUS
Application Programmer (e.g., business analyst, Data architect) Application Sophisticated Application Programmer (e.g., SAP admin) Query Processor Indexes Storage Subsystem Concurrency Control Recovery DBA, Tuner Operating System Hardware [Processor(s), Disk(s), Memory] H. Pang / NUS
Overview of Query Processing Database Statistics Cost Model Query Optimizer Query Evaluator Parsed Query QEP Parser High Level Query Query Result H. Pang / NUS
Outline Processing relational operators Query optimization Performance tuning H. Pang / NUS
Projection Operator R.attrib, .. (R) SELECT bid FROM Reserves R WHERE R.rname < ‘C%’ R.attrib, .. (R) Implementation is straightforward H. Pang / NUS
Selection Operator R.attr op value (R) FROM Reserves R WHERE R.rname < ‘C%’ R.attr op value (R) Size of result = R * selectivity Scan Clustered index: Good Non-clustered index: Good for low selectivity Worse than scan for high selectivity H. Pang / NUS
Example of Join SELECT * FROM Sailors R, Reserve S WHERE R.sid=S.sid H. Pang / NUS
Notations |R| = number of pages in outer table R ||R|| = number of tuples in outer table R |S| = number of pages in inner table S ||S|| = number of tuples in inner table S M = number of main memory pages allocated H. Pang / NUS
Simple Nested Loop Join R S 1 scan per R tuple |S| pages per scan Tuple ||R|| tuples H. Pang / NUS
Simple Nested Loop Join Scan inner table S per R tuple: ||R|| * |S| Each scan costs |S| pages For ||R|| tuples |R| pages for outer table R Total cost = |R| + ||R|| * |S| pages Not optimal! H. Pang / NUS
Block Nested Loop Join R S 1 scan per R block |S| pages per scan M – 2 pages |R| / (M – 2) blocks H. Pang / NUS
Block Nested Loop Join Scan inner table S per block of (M – 2) pages of R tuples Each scan costs |S| pages |R| / (M – 2) blocks of R tuples |R| pages for outer table R Total cost = |R| + |R| / (M – 2) * |S| pages R should be the smaller table H. Pang / NUS
Index Nested Loop Join R Index S 1 probe per R tuple Tuple ||R|| tuples H. Pang / NUS
Index Nested Loop Join Probe S index for matching S tuples per R tuple Probe hash index: 1.2 I/Os Probe B+ tree: 2-4 I/Os, plus retrieve matching S tuples: 1 I/O For ||R|| tuples |R| pages for outer table R Total cost = |R| + ||R|| * index retrieval Better than Block NL join only for small number of R tuples H. Pang / NUS
Sort Merge Join External sort R External sort S Merge sorted R and sorted S H. Pang / NUS
External Sort R Merge pass 2 R2,1 Merge pass 1 R1,1 R1,2 R1,M-1 … (m-1)-way merge Split pass R R0,1 R0,M-1 R0,M … Size of R0,i = M, # R0,i’s = |R|/M # merge passes = logM-1 |R|/M Cost per pass = |R| input + |R| output = 2 |R| Total cost = 2 |R| (logM-1 |R|/M + 1) including split pass H. Pang / NUS
Sort Merge Join External-sort R: 2 |R| * (logM-1 |R|/M + 1) Split R into |R|/M sorted runs each of size M: 2 |R| Merge up to (M – 1) runs repeatedly logM-1 |R|/M passes, each costing 2 |R| External-sort S: 2 |S| * (logM-1 |S|/M + 1) Merge matching tuples from sorted R and S: |R| + |S| Total cost = 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-1 |S|/M + 1) + |R| + |S| If |R| < M*(M-1), cost = 5 * (|R| + |S|) H. Pang / NUS
GRACE Hash Join R S = R0 S0 + R1 S1 + R2 S2 + R3 S3 H. Pang / NUS S 0 1 2 3 X X X bucketID = X mod 4 Join on R.X = S.X 1 R S = R0 S0 + R1 S1 + R2 S2 + R3 S3 R 2 3 H. Pang / NUS
GRACE Hash Join – Partition Phase M main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h1 M-1 Partitions . . . R (M – 1) partitions, each of size |R| / (M – 1) H. Pang / NUS
GRACE Hash Join – Join Phase Partitions of R & S Input buffer for Si Hash table for partition Ri (< M-1 pages) B main memory buffers Disk Output buffer Join Result hash fn h2 Partition must fit in memory: |R| / (M – 1) < M -1 H. Pang / NUS
GRACE Hash Join Algorithm Partition phase: 2 (|R| + |S|) Partition table R using hash function h1: 2 |R| Partition table S using hash function h1: 2 |S| R tuples in partition i will match only S tuples in partition I R (M – 1) partitions, each of size |R| / (M – 1) Join phase: |R| + |S| Read in a partition of R (|R| / (M – 1) < M -1) Hash it using function h2 (<> h1!) Scan corresponding S partition, search for matches Total cost = 3 (|R| + |S|) pages Condition: M > √f|R|, f ≈ 1.2 to account for hash table H. Pang / NUS
Summary of Join Operator Simple nested loop: |R| + ||R|| * |S| Block nested loop: |R| + |R| / (M – 2) * |S| Index nested loop: |R| + ||R|| * index retrieval Sort-merge: 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-1 |S|/M + 1) + |R| + |S| GRACE hash: 3 * (|R| + |S|) Condition: M > √f|R| H. Pang / NUS
Overview of Query Processing Database Statistics Cost Model Query Optimizer Query Evaluator Parsed Query QEP Parser High Level Query Query Result H. Pang / NUS
Query Optimization R.bid=100 AND S.rating>5 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Reserves Sailors sid=sid bid=100 rating > 5 sname Given: An SQL query joining n tables Dream: Map to most efficient plan Reality: Avoid rotten plans State of the art: Most optimizers follow System R’s technique Works fine up to about 10 joins H. Pang / NUS
Complexity of Query Optimization Many degrees of freedom Selection: scan versus (clustered, non-clustered) index Join: block nested loop, sort-merge, hash Relative order of the operators Exponential search space! Heuristics Push the selections down Push the projections down Delay Cartesian products System R: Only left-deep trees B A C D H. Pang / NUS
Equivalences in Relational Algebra Selection: - cascade - commutative Projection: - cascade Join: - associative R (S T) (R S) T (R S) (S R) H. Pang / NUS
Equivalences in Relational Algebra A projection commutes with a selection that only uses attributes retained by the projection Selection between attributes of the two arguments of a cross-product converts cross-product to a join A selection on just attributes of R commutes with join R S (i.e., (R S) (R) S ) Similarly, if a projection follows a join R S, we can `push’ it by retaining only attributes of R (and S) that are needed for the join or are kept by the projection H. Pang / NUS
System R Optimizer Find all plans for accessing each base table For each table Save cheapest unordered plan Save cheapest plan for each interesting order Discard all others Try all ways of joining pairs of 1-table plans; save cheapest unordered + interesting ordered plans Try all ways of joining 2-table with 1-table Combine k-table with 1-table till you have full plan tree At the top, to satisfy GROUP BY and ORDER BY Use interesting ordered plan Add a sort node to unordered plan H. Pang / NUS
H. Pang / NUS Source: Selinger et al, “Access Path Selection in a Relational Database Management System”
Note: Only branches for NL join are shown here. Additional branches for other join methods (e.g. sort-merge) are not shown. H. Pang / NUS Source: Selinger et al, “Access Path Selection in a Relational Database Management System”
What is “Cheapest”? Need information about the relations and indexes involved Catalogs typically contain at least: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree index. Catalogs updated periodically. Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. More detailed information (e.g., histograms of the values in some field) are sometimes stored. H. Pang / NUS
Estimating Result Size SELECT attribute list FROM relation list WHERE term1 AND ... AND termk Consider a query block: Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. Reduction factor (RF) associated with each termi reflects the impact of the term in reducing result size Term col=value has RF 1/NKeys(I) Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2)) Term col>value has RF (High(I)-value)/(High(I)-Low(I)) Result cardinality = Max # tuples * product of all RF’s. Implicit assumption that terms are independent! H. Pang / NUS
Cost Estimates for Single-Table Plans Index I on primary key matches selection: Cost is Height(I)+1 for a B+ tree, about 1.2 for hash index. Clustered index I matching one or more selects: (NPages(I)+NPages(R)) * product of RF’s of matching selects. Non-clustered index I matching one or more selects: (NPages(I)+NTuples(R)) * product of RF’s of matching selects. Sequential scan of file: NPages(R). Note: Typically, no duplicate elimination on projections! (Exception: Done on answers if user says DISTINCT.) H. Pang / NUS
Counting the Costs R.bid=100 AND S.rating>5 H. Pang / NUS With 5 buffers, cost of plan: Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform distribution) Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings). Sort T1 (2*10*2), sort T2 (2*250*4), merge (10+250), total=2300 Total: 4060 page I/Os If we used BNL join, join cost = 10+4*250, total cost = 2770 If we ‘push’ projections, T1 has only sid, T2 only sid and sname: T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 Reserves Sailors sid=sid bid=100 sname (On-the-fly) rating > 5 (Scan; write to temp T1) temp T2) (Sort-Merge Join) H. Pang / NUS
Exercise H. Pang / NUS Reserves: 100,000 tuples, 100 tuples per page With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages Join column sid is a key for Sailors - at most one matching tuple Decision not to push rating>5 before the join is based on availability of sid index on Sailors Cost: Selection of Reserves tuples (10 I/Os); for each tuple, must get matching Sailors tuple (1000*1.2); total 1210 I/Os (On-the-fly) sname (On-the-fly) rating > 5 (Index Nested Loops, sid=sid with pipelining ) bid=100 Sailors (Use hash Index on sid) (Use clustered index on sid) Reserves H. Pang / NUS
Query Tuning H. Pang / NUS
Avoid Redundant DISTINCT SELECT DISTINCT ssnum FROM Employee WHERE dept = ‘information systems’ DISTINCT usually entails a sort operation Slow down query optimization because one more “interesting” order to consider Remove if you know the result has no duplicates H. Pang / NUS
Change Nested Queries to Join SELECT ssnum FROM Employee WHERE dept IN (SELECT dept FROM Techdept) Might not use index on Employee.dept Need DISTINCT if an employee might belong to multiple departments SELECT ssnum FROM Employee, Techdept WHERE Employee.dept = Techdept.dept H. Pang / NUS
Avoid Unnecessary Temp Tables SELECT * INTO Temp FROM Employee WHERE salary > 40000 SELECT ssnum FROM Temp WHERE Temp.dept = ‘information systems’ Creating temp table causes update to catalog Cannot use any index on original table SELECT ssnum FROM Employee WHERE Employee.dept = ‘information systems’ AND salary > 40000 H. Pang / NUS
Avoid Complicated Correlation Subqueries SELECT ssnum FROM Employee e1 WHERE salary = (SELECT MAX(salary) FROM Employee e2 WHERE e2.dept = e1.dept Search all of e2 for each e1 record! SELECT MAX(salary) as bigsalary, dept INTO Temp FROM Employee GROUP BY dept SELECT ssnum FROM Employee, Temp WHERE salary = bigsalary AND Employee.dept = Temp.dept H. Pang / NUS
Avoid Complicated Correlation Subqueries SQL Server 2000 does a good job at handling the correlated subqueries (a hash join is used as opposed to a nested loop between query blocks) The techniques implemented in SQL Server 2000 are described in “Orthogonal Optimization of Subqueries and Aggregates” by C.Galindo-Legaria and M.Joshi, SIGMOD 2001. > 1000 > 10000 H. Pang / NUS
Join on Clustering and Integer Attributes SELECT Employee.ssnum FROM Employee, Student WHERE Employee.name = Student.name Employee is clustered on ssnum ssnum is an integer SELECT Employee.ssnum FROM Employee, Student WHERE Employee.ssnum = Student.ssnum H. Pang / NUS
Avoid HAVING when WHERE is enough SELECT AVG(salary) as avgsalary, dept FROM Employee GROUP BY dept HAVING dept = ‘information systems’ May first perform grouping for all departments! SELECT AVG(salary) as avgsalary FROM Employee WHERE dept = ‘information systems’ GROUP BY dept H. Pang / NUS
Avoid Views with unnecessary Joins CREATE VIEW Techlocation AS SELECT ssnum, Techdept.dept, location FROM Employee, Techdept WHERE Employee.dept = Techdept.dept SELECT dept FROM Techlocation WHERE ssnum = 4444 Join with Techdept unnecessarily SELECT dept FROM Employee WHERE ssnum = 4444 H. Pang / NUS
Aggregate Maintenance Materialize an aggregate if needed “frequently” Use trigger to update create trigger updateVendorOutstanding on orders for insert as update vendorOutstanding set amount = (select vendorOutstanding.amount+sum(inserted.quantity*item.price) from inserted,item where inserted.itemnum = item.itemnum ) where vendor = (select vendor from inserted) ; H. Pang / NUS
Avoid External Loops No loop: Loop: H. Pang / NUS sqlStmt = “select * from lineitem where l_partkey <= 200;” odbc->prepareStmt(sqlStmt); odbc->execPrepared(sqlStmt); Loop: sqlStmt = “select * from lineitem where l_partkey = ?;” for (int i=1; i<200; i++) { odbc->bindParameter(1, SQL_INTEGER, i); } H. Pang / NUS
Avoid External Loops Let the DBMS optimize set operations SQL Server 2000 on Windows 2000 Crossing the application interface has a significant impact on performance H. Pang / NUS
Avoid Cursors No cursor Cursor select * from employees; Cursor DECLARE d_cursor CURSOR FOR select * from employees; OPEN d_cursor while (@@FETCH_STATUS = 0) BEGIN FETCH NEXT from d_cursor END CLOSE d_cursor go H. Pang / NUS
Avoid Cursors SQL Server 2000 on Windows 2000 Response time is a few seconds with a SQL query and more than an hour iterating over a cursor H. Pang / NUS
Retrieve Needed Columns Only All Select * from lineitem; Covered subset Select l_orderkey, l_partkey, l_suppkey, l_shipdate, l_commitdate from lineitem; Avoid transferring unnecessary data May enable use of a covering index. H. Pang / NUS
Use Direct Path for Bulk Loading sqlldr directpath=true control=load_lineitem.ctl data=E:\Data\lineitem.tbl load data infile "lineitem.tbl" into table LINEITEM append fields terminated by '|' ( L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE "YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD", L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT ) H. Pang / NUS
Use Direct Path for Bulk Loading Direct path loading bypasses the query engine and the storage manager. It is orders of magnitude faster than for conventional bulk load (commit every 100 records) and inserts (commit for each record). H. Pang / NUS
Some Idiosyncrasies OR may stop the index being used break the query and use UNION Order of tables may affect join implementation H. Pang / NUS
Query Tuning – Thou Shalt … Avoid redundant DISTINCT Change nested queries to join Avoid unnecessary temp tables Avoid complicated correlation subqueries Join on clustering and integer attributes Avoid HAVING when WHERE is enough Avoid views with unnecessary joins Maintain frequently used aggregates Avoid external loops H. Pang / NUS
Query Tuning – Thou Shalt … Avoid cursors Retrieve needed columns only Use direct path for bulk loading H. Pang / NUS
Buffer Management & Tuning CS5226 Week 6 Buffer Management & Tuning
Outline Buffer management concepts & algorithms Buffer tuning
Moore’s Law being proved... In fact. Moore’s law will operate for many more years to come… CPUs will get faster, disks will get bigger, and so do communication speeds… To those who started their career early, you most probably recalled some of this numbers, when magnetic disk was an immature, just emerging technology. The 2nd column here provides numbers for an entry level workstation. One can easily get a server with 4 CPU, 4 Gigabytes of memory, and Plenty of harddisp space for less than 50K. The increase in speed is in a few orders of magnitude. Two interesting points to note are: Disk/memory ratio, and the factor of time taken in scanning the full disk. The number here simply tells us that the disk is getting bigger, and sequential scanning can still be very very time consuming.. The ratio tells us that while it may be possible to have terabytes of Memory in near future, it may still be NOT big enough to store the whole database! And hence a GOOD indexing structure is required!
Memory System CPU Die CPU Registers L1 Cache L2 Cache Main Memory Harddisk
Memory Hierarchy
Time = Seek Time + Rotational Delay + Transfer Time + Other
Rule of Random I/O: Expensive Thumb Sequential I/O: Much less Ex: 1 KB Block Random I/O: 20 ms. Sequential I/O: 1 ms.
Improving Access Time of Secondary Storage Organization of data on disk Disk scheduling algorithms e.g., elevator algorithm Multiple disks Mirrored disks Prefetching and large-scale buffering
DB Buffer vs OS Virtual Memory DBMS More semantics to pages Pages are not all equal More semantics to access patterns Queries are not all equal Facilitates prefetching More concurrency on pages (sharing and correctness) Pinning, forced writes Typical OS replacement policies not uniformly good – LRU, MRU, FIFO, LIFO, clock
Basic Concepts BUFFER POOL disk page free frame MAIN MEMORY DISK Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY Choice of frame dictated by replacement policy To understand the role of buffer manager, consider the following example. Suppose database contains 1,000,000 pages, but only 1000 pages of main memory are available for holding data. Consider a query that requires a scan of the entire file. Since all the data cannot reside in the main memory at one time, the DBMS must bring pages into main memory as they are needed, and in the process, decide which existing page in main memory to replace to make space for the new page. The policy used to decide which page to replace is called the replacement policy. Buffer pool – collection of available main memory pages or frames. Buffer manager maintains a table of <frame#, pageID> DISK DB Data must be in RAM for DBMS to operate on it! Table of <frame#, pageId> pairs is maintained. 4
Two variables maintained for each frame in buffer pool Basic Concepts Two variables maintained for each frame in buffer pool Pin count Number of times page in frame has been requested but not released Number of current users of the page Set to 0 initially Dirty bit Indicates if page has been modified since it was brought into the buffer pool from disk Turned off initially 6
Why Pin a Page? Page is in use by a query/transaction Log/recovery protocol enforced ordering Page is hot and will be needed soon, e.g., root of index trees.
When a Page is Requested ... If requested page is not in pool: Choose a frame for replacement If a free frame is not available, then choose a frame with pin count = 0 All requestors of the page in frame have unpinned or released it If dirty bit is on, write page to disk Read requested page into chosen frame Pin the page (increment the pin count) A page in pool may be requested many times Return address of page to requestor 5
Buffer Management: Parameters What are the design parameters that distinguish one BM from another? Buffer allocation: subdividing the pool Who owns a subdivision? Global? Per query? Per relation? How many pages to allocate? (working set) Replacement policy Which page to kick out when out of space? Load control Determine how much load to handle
Buffer Replacement Policies Frame is chosen for replacement by a replacement policy: Least-recently-used (LRU) Most-recently-used (MRU) First-In-First-Out (FIFO) Clock / Circular order Policy can have big impact on number of I/Os Depends on the access pattern. The critical choice that the buffer manager must make is what block to throw out of the buffer pool when a buffer is needed for a newly requested block. The buffer-replacement strategies commonly used may be familiar to you from other applications of scheduling policies, such as in operating systems. 7
Buffer Replacement Policies Least-recently-used (LRU) Buffers not used for a long time are less likely to be accessed Rule: Throw out the block that has not been read or written for the longest time. Maintain a table to indicate the last time the block in each buffer has been accessed. Each database access makes an entry in table. Expensive ? 7
Buffer Replacement Policies First-In-First-Out (FIFO) Rule: Empty buffer that has been occupied the longest by the same block Maintain a table to indicate the time a block is loaded into the buffer. Make an entry in table each time a block is read from disk Less maintenance than LRU No need to modify table when block is accessed 7
Buffer Replacement Policies Clock Algorithm Buffers arranged in a circle Each buffer associated with a Flag (0 or 1) Flag set to 1 when A block is read into a buffer Contents of a buffer is accessed A “hand” points to one of the buffers Rotate clockwise to find a buffer with Flag=0 If it passes a buffer with Flag=1, set it to 0 1 Clock algorithm is commonly implemented, efficient approximation to LRU. Think of the buffers as arranged in a circle. A “hand” points to one of the buffers, and will rotate clockwise if it needs to find a buffer in which to place a disk block. Each buffer has an associated flag, 0 or 1. Buffers with 0 flag are vulnerable to having their contents sent back to disk; buffers with a 1 are not. When a block is read into a buffer, its flag is set to 1. Likewise, when the contents of a buffer is accessed, its flag is set to 1. When the BM needs a buffer for a new block, it looks for the first 0 it can find, rotating clockwise. If it passes 1, it sets them to 0. Thus, a block is thrown out of the buffer if it remains unaccessed for the time it takes the hand to make a complete rotation to set its flag to 0, and then make another complete rotation to find the buffer with its 0 unchanged. Eg. Hand will set 0 to 1 in the buffer to its left, and then move clockwise to find the buffer with 0, whose block it will replace and whose flag it will set to 1. 7
Buffer Replacement Policies Clock Algorithm (cont’d) Rule: Throw out a block from buffer if it remains unaccessed when the hand makes a complete rotation to set its flag to 0, and another complete rotation to find the buffer with its 0 unchanged 1 7
LRU-K Self-tuning Reference Approach the behavior of buffering algorithms in which pages sets with known access frequencies are manually assigned to different buffer pools of specifically tuned sizes Does not rely on external hints about workload characteristics Adapts in real time to changing patterns of access Provably optimal Reference E. O’Neil, P. O’Neil, G. Weikum: The LRU-K Page Replacement Algorithm for Database Disk Buffering, SIGMOD 1993, pp. 297-306
Motivation GUESS when the page will be referenced again. Problems with LRU?
Motivation GUESS when the page will be referenced again. Problems with LRU? Makes decision based on too little info Cannot tell between frequent/infrequent refs on time of last reference System spends resources to keep useless stuff around
Example 1 CUSTOMER has 20k tuples Clustered B+-tree on CUST_ID, 20 b/key 4K page, 4000 bytes useful space 100 leaf pages Many users (random access) References L1,R1,L2,R2,L3,R3,… Probability to ref Li is .005, to ref Ri is .00005
LRU-K Basic Concepts Ideas: Take into account history – last K references (Classic LRU: K = 1 (LRU-1)) (keeps track of history, and try to predict)
Basic concepts Parameters: Pages N={1,2,…,n} Reference string r1,r2,…, rt, … rt = p for page p at time t bp = probability that rt+1=p Time between references of p: Ip = 1/bp
Algorithm Backward K-distance bt(p,K): #refs from t back to the Kth most recent references to p bt(p,K) = INF if Kth ref doesn’t exist Algorithm: Drop page p with max Backward K-distance bt(p,K) Ambiguous when infinite (use subsidiary policy, e.g., LRU) LRU-2 is better than LRU-1 Why?
Problem 1 Early page replacement Page bt(p,K) is infinite, so drop What if it is a rare but “bursty” case? What if there are Correlated References Intra-transaction, e.g., read tuple, followed by update Transaction Retry Intra-process, i.e., a process references page via 2 transactions, e.g., update RIDs 1-10, commit, update 11-20, commit, … Inter-process, i.e., two processes reference the same page independently
Example For example, assume (I) – read/update Algorithm sees p (read) Drops it (infinite bt(p,K)) (wrong decision) Sees it again (update) Keeps it around (wrong decision again)
Addressing Correlation Correlated Reference Period (CRP) No penalty or credit for refs within CRP Ip: interval from end of one CRP to begin of the next CRP Ip
Problem 2 Reference Retained Information Algorithm needs to keep info for pages that may not be resident anymore, e.g., LRU-2 P is referenced and comes in for the first time bt(p,2) = INF, p is dropped P is referenced again If no info on p is retained, p may be dropped again
Solution to Problem 2 Retained Information Period (RIP) Period after which we drop information about page p Upper bound: max Backward K-distance of all pages we want to ensure to be memory resident
Data Structures for LRU-K HIST(p) – history control block of page p = Time of K most recent references to p - correlated LAST(p) – time of most recent ref to page p, correlated references OK Maintained for all pages p: bt(p,K) < RIP
LRU-K Algorithm If p is in the buffer { // update history of p if (t – LAST(p)) > CRP { // uncorrelated ref // close correlated period and start new for i = K-1 to 1 move HIST(p,i) into slot HIST(p,i+1) HIST(p,1) = t } LAST(p) = t
LRU-K Algorithm (Cont) else { // select replacement victim min = t for all pages q in buffer { if (t – LAST(p) > CRP // eligible for replacement and HIST(q,K) < min { // max Backward-K victim = q min = HIST(q,K) } if victim dirty, write back before dropping
LRU-K Algorithm (Cont) Fetch p into the victim’s buffer if no HIST(p) exists { allocate HIST(p) for i = 2 to K HIST(p,i) = 0 } else { for i = 2 to K HIST(p,i) = HIST(p,i-1) } HIST(p,1) = t // last non-correlated reference LAST(p) = t // last reference
Example 2 R has 1M tuples A bunch of processes ref 5000 (0.5%) tuples A few batch processes do sequential scans
Stochastic OS Replacement Policies Least recently used (LRU) Most recently used (MRU) First in first out (FIFO) Last in first out (LIFO) … None takes into account DBMS access patterns
Domain Separation Domain Buffer Pool Hash index B-tree STEAL Data
Domain Separation (2) Pages are classified into domains LRU within domain Domains are static Pages belong to domains regardless of usage E.g. sequential scan versus nested loop No differentiation in importance of domains E.g. Index page more useful than data page Does not prevent over-utilization of memory by multiple users; no notion of users/queries Need orthogonal load control
Group LRU Domain Buffer Pool B-tree Hash index STEAL Data Free list
Group LRU (2) Like Domain Separation Prioritize domains Steal buffer in order of priority No convincing evidence that this is better than LRU!
“New” Algorithm in INGRES Each relation needs a working set Buffer pool is subdivided and allocated on a per-relation basis Each active relation is assigned a resident set which is initially empty The resident sets are linked in a priority list; unlikely reused relations are near the top Ordering of relation is pre-determined, and may be adjusted subsequently Search from top of the list With each relation, use MRU
“New” Algorithm Pros Cons A new approach that tracks the locality of a query through relations Cons MRU is not always good How to determine priority (especially in multi-user context)? Costly search of list under high loads
Hot-Set Model Hot set: set of pages over which there is a looping behavior Hot set in memory implies efficient query processing #page faults vs size of buffers – points of discontinuities called hot points
(discontinuity in curve) Hot Set (2) Hot point (discontinuity in curve) No hot point! # page faults # page faults # buffers # buffers LRU MRU
Hot Set Model Key ideas Problems Give query |hot set| pages Allow <=1 deficient query to execute Hot set size computed by query optimizer (provides more accurate reference pattern) Use LRU within each partition Problems LRU not always best and allocate more memory Over-allocates pages for some phases of query
DBMIN Based on “Query Locality Set Model” DBMS supports a limited set of operations Reference patterns exhibited are regular and predictable Complex patterns can be decomposed into simple ones Reference pattern classification Sequential Random Hierarchical Reference Hong-Tai Chou, David J. DeWitt: An Evaluation of Buffer Management Strategies for Relational Database Systems. VLDB 1985: 127-141
DBMS Reference Patterns Straight sequential (SS) Clustered sequential (CS) Looping sequential (LS) Independent random (IR) Clustered random (CR) Straight hierarchical (SH) Hierarchical with straight sequential (H/SS) Hierarchical with clustered sequential (H/CS) Looping hierarchical (LH)
Sequential Patterns Straight sequential (SS) File scan without repetition E.g., selection on an unordered relation #pages? Replacement algorithm? Table R R1 R2 R3 R4 R5 R6
Sequential Patterns Straight sequential (SS) File scan without repetition E.g., selection on an unordered relation #pages? 1 Replacement algorithm? Table R R1 R2 R3 R4 R5 R6
Sequential Patterns Straight sequential (SS) File scan without repetition E.g., selection on an unordered relation #pages? 1 Replacement algorithm? Replaced with next one Table R R1 R2 R3 R4 R5 R6
Sequential Patterns (Cont) Clustered sequential (CS) Like inner S for merge-join (sequential with backup) Local rescan in SS Join condition: R.a = S.a #Pages? Replacement algo? 4 4 4 4 4 4 7 4 7 7 8 7
Sequential Patterns (Cont) Clustered sequential (CS) Like inner S for merge-join (sequential with backup) Local rescan in SS Join condition: R.a = S.a #Pages? #pages in largest cluster Replacement algo? 4 4 4 4 4 4 7 4 7 7 8 7
Sequential Patterns (Cont) Clustered sequential (CS) Like inner S for merge-join (sequential with backup) Local rescan in SS Join condition: R.a = S.a #Pages? #pages in largest cluster Replacement algo? FIFO/LRU 4 4 4 4 4 4 7 4 7 7 8 7
Sequential Patterns (Cont) Looping sequential (LS) Sequential reference be repeated several times e.g., Like inner S for nested-loop-join #Pages? Replacement algo? 4 4 4 4 4 4 7 4 7 7 8 7
Sequential Patterns (Cont) Looping sequential (LS) Sequential reference be repeated several times e.g., Like inner S for nested-loop-join #Pages? As many as possible Replacement algo? 4 4 4 4 4 4 7 4 7 7 8 7
Sequential Patterns (Cont) Looping sequential (LS) Sequential reference be repeated several times e.g., Like inner S for nested-loop-join #Pages? As many as possible Replacement algo? MRU 4 4 4 4 4 4 7 4 7 7 8 7
Random Pattterns Independent Random (IR) Genuinely random accesses e.g., non-clustered index scan R1 R2 R3 R4 R5 R6
Random Pattterns Independent Random (IR) Genuinely random accesses e.g., non-clustered index scan One page (assuming low prob. of reaccesses) R1 R2 R3 R4 R5 R6
Random Pattterns Independent Random (IR) Genuinely random accesses e.g., non-clustered index scan One page (assuming low prob. of reaccesses) Any replacement algorithm! R1 R2 R3 R4 R5 R6
Random Pattterns Clustered Random (CR) Random accesses which demonstrate locality e.g., join with inner, non-clustered, non-unique index on join column R1 S1 R2 S2 R3 S3 R4 S4 R5 S5 R6 S6
Random Pattterns Clustered Random (CR) Random accesses which demonstrate locality e.g., join with inner, non-clustered, non-unique index on join column #records in largest cluster R1 S1 R2 S2 R3 S3 R4 S4 R5 S5 R6 S6
Random Pattterns Clustered Random (CR) Random accesses which demonstrate locality e.g., join with inner, non-clustered, non-unique index on join column #records in largest cluster As in CS R1 S1 R2 S2 R3 S3 R4 S4 R5 S5 R6 S6
Hierarchical Pattterns Straight Hierarchical (SH) Access index pages ONCE (retrieve a single tuple) R1 R2 R3 R4 R5 R6
Hierarchical Pattterns Straight Hierarchical (SH) Access index pages ONCE (retrieve a single tuple) Like SS R1 R2 R3 R4 R5 R6
Hierarchical Pattterns Straight Hierarchical (SH) Access index pages ONCE (retrieve a single tuple) Followed by straight sequential scan (H/SS) Like SS R1 R2 R3 R4 R5 R6
Hierarchical Pattterns Straight Hierarchical (SH) Access index pages ONCE (retrieve a single tuple) Followed by straight sequential scan (H/SS) Like SS Followed by clustered scan (H/CS) Like CS R1 R2 R3 R4 R5 R6
Hierarchical Pattterns Looping Hierarchical (LH) Repeatedly traverse an index, e.g., when inner index in join is repeatedly accessed
Hierarchical Pattterns Looping Hierarchical (LH) Repeatedly traverse an index, e.g., when inner index in join is repeatedly accessed Size is height of tree LIFO need to keep the root
DBMIN A buffer management algorithm based on QLSM Buffers allocated on a per-file instance basis Active instances of same file have different BPs Those are independently managed May share a same buffered page through global table Each file instance has its locality set (lset) of pages Manage each lset by the access pattern for that file Each page in buffer belongs to at most 1 lset Global, shared table of buffers too
What’s Implemented in DBMS? DB2 & Sybase ASE Named pools to be bound to tables or indexes Each pool can be configured to use clock replacement or LRU (ASE) Client can indicate pages to replace Oracle A table can be bound to 1 to 2 pools, one with higher priority to be kept Others Global pools with simple policies
Summary Algorithms LRU, MRU, LIFO, … Domain separation (assign pages to domain) Group LRU (prioritize domains) NEW (resident set per relation) Hot set (per query) DBMIN (locality set per file instance) DBMS reference patterns Sequential Straight Sequential Clustered Sequential Looping Sequential Random Independent Random Clustered Random Hierarchical Straight Hierarchical With Straight Sequential With Clustered Sequential Looping Hierarchical
Buffer Tuning DBMS buffer tuning (Oracle 9i)
Database Buffers Application buffers DBMS buffers OS buffers An application can have its own in-memory buffers (e.g., variables in the program; cursors); A logical read/write will be issued to the DBMS if the data needs to be read/written to the DBMS; A physical read/write is issued by the DBMS using its systematic page replacement algorithm. And such a request is passed to the OS. OS may initiate IO operations to support the virtual memory the DBMS buffer is built on. DBMS buffers OS buffers
Database Buffer Size LOG DATA RAM Paging Disk DATABASE PROCESSES DATABASE BUFFER Buffer too small, then hit ratio too small hit ratio = (logical acc. - physical acc.) / (logical acc.) Buffer too large, wasteful at the expense of others Recommended strategy: monitor hit ratio and increase buffer size until hit ratio flattens out. If there is still paging, then buy memory.
Overall Cache Hit Ratio Cache hit ratio = (# logical read - # physical read) / # logical read Ideally, hit ratio > 80% Overall buffer cache hit ratio for entire instance SELECT (P1.value + P2.value - P3.value) / (P1.value + P2.value) FROM v$sysstat P1, v$sysstat P2, v$sysstat P3 WHERE P1.name = 'db block gets‘ AND P2.name = 'consistent gets‘ AND P3.name = 'physical reads'
Session Cache Hit Ratio Buffer cache hit ratio for one specific session SELECT (P1.value + P2.value - P3.value) / (P1.value + P2.value) FROM v$sesstat P1, v$statname N1, v$sesstat P2, v$statname N2, v$sesstat P3, v$statname N3 WHERE N1.name = 'db block gets‘ AND P1.statistic# = N1.statistic# AND P1.sid = <enter SID of session here> AND N2.name = 'consistent gets‘ AND P2.statistic# = N2.statistic# AND P2.sid = P1.sid AND N3.name = 'physical reads‘ AND P3.statistic# = N3.statistic# AND P3.sid = P1.sid
Adjust Buffer Cache Size Buffer size = db_block_buffers * db_block_size db_block_size is set at database creation; cannot tune Change the db_block_buffers parameter
Should Buffer Cache Be Larger? Set db_block_lru_extended_statistics to 1000 Incurs overhead! Set back to 0 when done SELECT 250 * TRUNC (rownum / 250) + 1 || ' to ' || 250 * (TRUNC (rownum / 250) + 1) "Interval", SUM (count) "Buffer Cache Hits“ FROM v$recent_bucket GROUP BY TRUNC (rownum / 250) Interval Buffer Cache Hits --------------- ----------------------- 1 to 250 16083 251 to 500 11422 501 to 750 683 751 to 1000 177
Should Buffer Cache Be Smaller? Set db_block_lru_statistics to true SELECT 1000 * TRUNC (rownum / 1000) + 1 || ' to ' || 1000 * (TRUNC (rownum / 1000) + 1) "Interval", SUM (count) "Buffer Cache Hits“ FROM v$current_bucket WHERE rownum > 0 GROUP BY TRUNC (rownum / 1000) Interval Buffer Cache Hits ------------ ----------------------- 1 to 1000 668415 1001 to 2000 281760 2001 to 3000 166940 3001 to 4000 14770 4001 to 5000 7030 5001 to 6000 959
I/O Intensive SQL Statements v$sqlarea contains one row for each SQL statement currently in the system global area Executions: # times the statement has been executed since entering SGA Buffer_gets: total # logical reads by all executions of the statement Disk_reads: total # physical reads by all executions of the statement SELECT executions, buffer_gets, disk_reads, first_load_time, sql_text FROM v$sqlarea ORDER BY disk_reads
Swapping of Data Pages Monitoring tools: sar or vmstat If system is swapping Remove unnecessary system daemons and applications Decrease number of database buffers Decrease number of UNIX file buffers
Paging of Program Blocks Monitoring tools: sar or vmstat To reduce paging Install more memory Move some programs to another machine Configure SGA to use less memory Compare paging activities during fast versus slow response
SAR – Monitoring Tool vmstat –S 5 8 procs memory page disk faults cpu r b w swap free si so pi po fr de sr f0 s0 s1 s3 in sy cs us sy id 0 0 0 1892 5864 0 0 0 0 0 0 0 0 0 0 0 90 74 24 0 0 99 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 0 46 25 21 0 0 100 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 0 47 20 18 0 0 100 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 2 53 22 20 0 0 100 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 0 87 23 21 0 0 100 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 0 48 41 23 0 0 100 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 0 44 20 18 0 0 100 0 0 0 85356 8372 0 0 0 0 0 0 0 0 0 0 0 51 71 24 0 0 100 # swap-in, swap-out per sec # page-in, page-out per sec 1 = swapped out processes
Buffer Size - Data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); clustered index c on employees(lat); (unused) 10 distinct values of lat and long, 100 distinct values of hundreds1 and hundreds2 20000000 rows (630 Mb); Warm Buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000 RPM), Windows 2000.
Buffer Size - Queries Scan Query select sum(long) from employees; Multipoint query select * from employees where lat = ?;
Database Buffer Size SQL Server 7 on Windows 2000 Scan query: LRU (least recently used) does badly when table spills to disk as Stonebraker observed 20 years ago. Multipoint query: Throughput increases with buffer size until all data is accessed from RAM.
Summary Monitor cache hit ratio Increase/reduce buffer cache size Pay attention to I/O intensive SQL statements Avoid swapping Check for excessive paging
CS5226 Hardware Tuning
Application Programmer (e.g., business analyst, Data architect) Application Sophisticated Application Programmer (e.g., SAP admin) Query Processor Indexes Storage Subsystem Concurrency Control Recovery DBA, Tuner Operating System Hardware [Processor(s), Disk(s), Memory]
Outline Part 1: Tuning the storage subsystem RAID storage system Choosing a proper RAID level Part 2: Enhancing the hardware configuration
Magnetic Disks Controller read/write head disk arm tracks platter spindle actuator disk interface 1956: IBM (RAMAC) first disk drive 5 Mb – 0.002 Mb/in2 35000$/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625 Kb/sec 1999: IBM MICRODRIVE first 1’’ disk drive 340Mb 6.1 MB/sec Discussion: See disks downstairs: form factor is an element in the evolution Aerial density is main challenge for capacity: How many bits per track (or sector a fraction of a track)? Coat of the platter, bit encoding, process for encoding information. Rotation speed is main challenge for throughput: How fast can the heas read and decode without making too many mistakes? – currently around 10000 RPM is current Actuator and control are key for the access time. - controller with cache and processor.
Magnetic Disks Access Time (2001) Disk Interface Controller overhead (0.2 ms) Seek Time (4 to 9 ms) Rotational Delay (2 to 6 ms) Read/Write Time (10 to 500 KB/ms) Disk Interface IDE (16 bits, Ultra DMA - 25 MHz) SCSI: width (narrow 8 bits vs. wide 16 bits) - frequency (Ultra3 - 80 MHz). http://www.pcguide.com/ref/hdd/ Questions: - is data stored on both sides of a platter? Discussion - implication on performances: minimize seek time (sequential access, prefetching/large pages) shared disk (wait most of the time – minimize disk arm) Question: Given disk characteristics: what are principles for data layout? 1 – hot data in RAM - cache (memory hierarchy) 2 – sequential access vs. random access. Sequential access should be favored (locality for reads and writes should be researched and preserved) 3 – role of controller
Storage Metrics DRAM Disk Tape Robot 2GB 18GB 14x70Gb Unit Price 1600$ Unit Capacity 2GB 18GB 14x70Gb Unit Price 1600$ 467$ 20900$ $/Gb 800 26 21 Latency (sec) 1.E-8 2.E-3 (15k RPM) 3.E+1 Bandwidth (Mbps) 1000 40 (up to 160) 40 (up to 100) Kaps 1.E+6 470 3.E-2 Maps 1.E+3 23 Scan time (sec/Tb) 2 450 24500
Hardware Bandwidth System Bandwidth Yesterday in megabytes per second (not to scale!) The familiar bandwidth pyramid: The farther from the CPU, the less the bandwidth. 422 15 per disk 133 40 Slide courtesy of J. Gray/L.Chung Hard Disk | SCSI | PCI | Memory | Processor
Hardware Bandwidth System Bandwidth Today in megabytes per second (not to scale!) The familiar pyramid is gone! PCI is now the bottleneck! In practice, 3 disks can reach saturation using sequential IO 1,600 160 26 133 Slide courtesy of J. Gray/L.Chung Hard Disk | SCSI | PCI | Memory | Processor
RAID Storage System Redundant Array of Inexpensive Disks Combine multiple small, inexpensive disk drives into a group to yield performance exceeding that of one large, more expensive drive Appear to the computer as a single virtual drive Support fault-tolerance by redundantly storing information in various ways
RAID 0 - Striping No redundancy High I/O performance No fault tolerance High I/O performance Parallel I/O
RAID 1 – Mirroring Provide good fault tolerance Works ok if one disk in a pair is down One write = a physical write on each disk One read = either read both or read the less busy one Could double the read rate
RAID 3 - Parallel Array with Parity Fast read/write All disk arms are synchronized Speed is limited by the slowest disk
Parity Check - Classical An extra bit added to a byte to detect errors in storage or transmission Even (odd) parity means that the parity bit is set so that there are an even (odd) number of one bits in the word, including the parity bit A single parity bit can only detect single bit errors since if an even number of bits are wrong then the parity bit will not change It is not possible to tell which bit is wrong
RAID 5 – Parity Checking For error detection, rather than full redundancy Each stripe unit has an extra parity stripe Parity stripes are distributed
RAID 5 Read/Write Read: parallel stripes read from multiple disks Good performance Write: 2 reads + 2 writes Read old data stripe; read parity stripe (2 reads) XOR old data stripe with new data stripe. XOR result into parity stripe. Write new data stripe and new parity stripe (2 writes).
RAID 10 – Striped Mirroring RAID 10 = Striping + mirroring A striped array of RAID 1 arrays High performance of RAID 0, and high tolerance of RAID 1 (at the cots of doubling disks) .. More information about RAID disks at http://www.acnc.com/04_01_05.html
Hardware vs. Software RAID Software RAID: run on the server’s CPU Directly dependent on server CPU performance and load Occupies host system memory and CPU operation, degrading server performance Hardware RAID Hardware RAID: run on the RAID controller’s CPU Does not occupy any host system memory. Is not operating system dependent Host CPU can execute applications while the array adapter's processor simultaneously executes array functions: true hardware multi-tasking
RAID Levels - Data Settings: 100000 rows Cold Buffer accounts( number, branchnum, balance); create clustered index c on accounts(number); 100000 rows Cold Buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
RAID Levels - Transactions No Concurrent Transactions: Read Intensive: select avg(balance) from accounts; Write Intensive, e.g. typical insert: insert into accounts values (690466,6840,2272.76); Writes are uniformly distributed.
RAID Levels SQL Server7 on Windows 2000 (SoftRAID means striping/parity at host) Read-Intensive: Using multiple disks (RAID0, RAID 10, RAID5) increases throughput significantly. Write-Intensive: Without cache, RAID 5 suffers. With cache, it is ok.
Comparing RAID Levels RAID 0 RAID 1 RAID 5 RAID 10 Read High 2X Write Medium Fault tolerance No Yes Disk utilization Low Key problems Data lost when any disk fails Use double the disk space Lower throughput with disk failure Very expensive, not scalable Key advantages High I/O performance Very high I/O performance A good overall balance High reliability with good performance
Controller Pre-fetching No, Write-back Yes Read-ahead: Prefetching at the disk controller level. No information on access pattern. Better to let database management system do it. Write-back vs. write through: Write back: transfer terminated as soon as data is written to cache. Batteries to guarantee write back in case of power failure Write through: transfer terminated as soon as data is written to disk.
SCSI Controller Cache - Data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); create clustered index c on employees(hundreds2); Employees table partitioned over two disks; Log on a separate disk; same controller (same channel). 200 000 rows per table Database buffer size limited to 400 Mb. Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
SCSI (not disk) Controller Cache - Transactions No Concurrent Transactions: update employees set lat = long, long = lat where hundreds2 = ?; cache friendly: update of 20,000 rows (~90Mb) cache unfriendly: update of 200,000 rows (~900Mb)
SCSI Controller Cache SQL Server 7 on Windows 2000. Adaptec ServerRaid controller: 80 Mb RAM Write-back mode Updates Controller cache increases throughput whether operation is cache friendly or not. Efficient replacement policy!
Which RAID Level to Use? Data and Index Files Log File Temporary Files RAID 5 is best suited for read intensive apps or if the RAID controller cache is effective enough. RAID 10 is best suited for write intensive apps. Log File RAID 1 is appropriate Fault tolerance with high write throughput. Writes are synchronous and sequential. No benefits in striping. Temporary Files RAID 0 is appropriate. No fault tolerance. High throughput.
What RAID Provides Fault tolerance High I/O performance It does not prevent disk drive failures It enables real-time data recovery High I/O performance Mass data capacity Configuration flexibility Lower protected storage costs Easy maintenance
Enhancing Hardware Config. Add memory Cheapest option to get better performance Can be used to enlarge DB buffer pool Better hit ratio If used for enlarge OS buffer (as disk cache), it benefits but to other apps as well Add disks Add processors
Add Disks Larger disk ≠better performance Add disks for Bottleneck is disk bandwidth Add disks for A dedicated disk for the log Switch RAID5 to RAID10 for update-intensive apps Move secondary indexes to another disk for write-intensive apps Partition read-intensive tables across many disks Consider intelligent disk systems Automatic replication and load balancing
Add Processors Function parallelism Data partition parallelism Use different processors for different tasks GUI, Query Optimisation, TT&CC, different types of apps, different users Operation pipelines: E.g., scan, sort, select, join… Easy for RO apps, hard for update apps Data partition parallelism Partition data, thus the operation on the data
Parallelism Some tasks are easier to parallelize E.g., join phase of GRACE hash join E.g., scan, join, sum, min Some tasks are not so easy E.g., sorting, avg, nested-queries
Summary We have covered: The storage subsystem RAID: what are they and which one to use? Memory, disks and processors When to add what?
Database Tuning Database Tuning is the activity of making a database application run more quickly. “More quickly” usually means higher throughput, though it may mean lower response time for time-critical applications.
Tuning Principles Think globally, fix locally Partitioning breaks bottlenecks (temporal and spatial) Start-up costs are high; running costs are low Render onto server what is due onto Server Be prepared for trade-offs (indexes and inserts)
Tuning Mindset Set reasonable performance tuning goals Measure and document current performance Identify current system performance bottleneck Identify current OS bottleneck Tune the required components eg: application, DB, I/O, contention, OS etc Track and exercise change-control procedures Repeat step 3 through 7 until the goal is met
Goals Met? Appreciation of DBMS architecture Study the effect of various components on the performance of the systems Tuning principle Troubleshooting techniques for chasing down performance problems Hands-on experience in Tuning