Database Tuning Prerequisite Cluster Index B+Tree Indexing Hash Indexing ISAM (indexed Sequential access)
Physical Data base Design and Tuning The creation of physical schema after the creation of conceptual schema and integrity constrains is physical data base design Refining the conceptual schema and hence modifying physical schema and indexing schemes to improve the performance of the underlying database is called tuning Work load. An estimate of the frequency and selectivity of a list of expected queries and update operations, with user performance preferences if any is called work load
Work load – List of queries and their frequencies – List of updates and their frequencies – Performance goals (as expected by users) on the listed queries and updates Analysis on the work load – For each query Identify the relations which are accessed by the operation Attributes which are retained in the (SELECT CLAUSE) Attributes which have selection or join conditions on them in (WHERE CLAUSE) Selectivity of the conditions – For each update operation Identify the relations which are accessed by the operation Kind of update (DELETE, INSERT or UPDATE) The fields that are modified by the UPDATE
Important decisions made during tuning Creation of Indices – What type of index to use – Should the index be clustered or not Alteration to the conceptual schema – Choosing alternative normalization schemes – Denormalization – Vertical Partitioning – Views Rewriting frequently run queries and transactions effectively
Guidelines with example Decide whether to index a particular query attribute or not Issues – When there is no index the time to search a tuple is O(n) – b is the total number of tuples – Indexing an attribute does increase the query performance, since the searched tuple can easily be traced without a sequential scan less time Caution – When there are more updates on the attribute than query then adding an index may increase the overhead of update operations
Guideline 2 Selection conditions decide Indexing techniques – An equality selection would be faster with a hash indexing scheme SELECT E.Dno FROM Employee E WHERE E.ename = ‘Roy’ – A range selection would be faster with a B+Tree indexing SELECT E.Dno FROM Employee E WHERE E.Age < 40
Guideline 3 Join conditions decide the indexing Scheme – When there is a join condition on the attributes to be indexed – Selecting a hash indexing scheme on the inner relation attribute would be better than B+Tree SELECT E.ename, D.mgr FROM Employee E, Department D WHERE E.dno = D.dno A hash indexing on D.dno may be faster than B+Tree
Guideline 4 A very important decision regarding which attribute should be chosen for clustering – At most one attribute can be used for clustering (i.e., physical ordering of the records in the disk) – Only sequential scan on the attribute benefits from clustering since there will be fewer disk accesses on the tuple – If a selectivity of a query is low or if the frequency of a particular query is low, clustering the corresponding attributes may not be beneficial – If there is no sequencial access but only random accesses on the attribute, then a hash indexing without clustering may be enough. – To choose the best attribute for clustering, consider all the queries, updates, their frequencies and selectivity and choose the attributes which in frequent queries with range conditions with high selectivity.
Guideline 5 Balancing the cost of index maintenance – Indexes may need to be dropped if they slow down any of the frequent update operation – Indexes may also speed up update operations when there is a WHERE condition in the update operation UPDATE SET E.Income = WHERE E.eid = 123
Additonal Guideline 1 Analyse the query evaluation plan of the underlying database engine E.g., SELECT E.ename, D.mgr FROM Employee E, Department D WHERE D.dname = ‘Toy’ AND E.dno = D.dno Here both D.dname and D.dno may be indexed
Query Evaluation Plan guides index choices Department D Department σ Dname =‘Toy’ D.dno = E.dno ∏ E.ename Employee E the join is done on the resulting tuple and not on the database. So an index on D.dno is unnecessary All tuples with Dname = ‘Toy’ gets selected from the database Always the attributes in the lowest point of the plan are to be indexed. Others need not be indexed
Clustering Indices Clustered B+Tree UnClustered B+Trees Physical Storage in Secondary Storage devices If three records fit in a page A range operation < 60 will make 2 access to the disk If three records fit in a page A range operation < 60 will make 6 access to the disk once for every tuple
Selectivity decides clustering and non clustering index schemes When there are more than one query plan available, the selectivity of a condition decides the best plan and also the attribute to be clustered SELECT E.ename, D.Mgr FROM Employee E, Department D WHERE E.hobby = ‘stamps’ E.salary Between AND E. Dno = D.Dno
Query Plan 1 Is almost all are collecting stamps and are salaried in the given range, then an index need not be present on the two attributes (High selectivity). Department D E.dno = D.dno ∏ E.ename Employee E Selection is done on the resulting tuple. So an index on E.hobby and E.salary are unnecessary σ E.Hobby =‘Stamps’ σ Salary Between(10000,30000) A join is done using E.dno as external tuple. So a clustered index on E.dno is necessary A hash index is desirable for internal tuple attribute D.dno. It need not be clustered because it is not accessed sequentially
Query Plan 2 If only a few are collecting stamps (Low selectivity) Employee E D.dno = E.dno ∏ E.ename Department D Selection is done on the resulting tuple. So an index on E.Dno and E.salary are not necessary σ E.Hobby =‘Stamps’ σ Salary Between(10000,30000) Selection is done on the resulting tuple. So an index on E.Hobby is necessary and also a clustered index
Impact of clustering on Cost of operation Cost Percentage of tuples retrieved (selectivity) Unclustered index scheme Percentage of tuples retrieved (selectivity) Clustered index scheme Percentage of tuples retrieved (selectivity) Sequential scan Cost Percentage of tuples retrieved (selectivity) All schemes Cost The range for which unclustered index is better than sequential scan
Co-clustered indexing Co clustered index on manager income in department tuple and employee income in Employee tuple Each Department record is followed by its employees’ records in the secondary storage device. Clustered B+Tree Clustered Index on manager income (22 => 22000) SELECT ALL E.ename FROM Employee E, Department D WHERE D.MgrIncome > 53 AND D.Dno = E.Dno Selecting all employees who work under managers with income > 53 Only looks through the index on manager’s income an directly accesses his employees records on the disk
Indexes on multiple attribute search keys The following query returns all employees with age between 20 and 30 and having a salary from 3000 to 5000 SELECT E.eid FROM Employee E WHERE E.age BETWEEN 20 and 30 AND E.sal BETWEEN 3000 and 5000 A composite clustered B+ tree index on (age,salary) would be a better choice than one on (salary,age) since more employees would be having same age than the same salary A composite clustered index on (age,sal) first sorts tuples according to age and then within each age group sorts them according to salary
Index only plans If only index files are used to answer the query wthout accessing the actual data from the table then it is called index only plan If a composite index happens to have all relevant attributes of a query (including the one after the SELECT clause) as a search key then no database access is needed. SELECT E.eid FROM Employee E WHERE E.age BETWEEN 20 and 30 AND E.sal BETWEEN 3000 and 5000 A composite index on (age, sal, eid) is defined then no database access is needed. All eids in the index file satisfying the condition can be displayed as result In an index only access like this one clustering of physical records is not necessary.
Index only plans SELECT E.dno, COUNT(*) FROM Employee E GROUP BY E.dno SELECT E.dno, MIN(E.sal) FROM Employee E GROUP BY E.dno In the example if there is an index on E.dno we can just count the number of entries for each dno and give that as an answer for count(*). We need not access the secondary storage for tuples. This is also an example of index only plans Since index only schemes are faster, composite indices may be defined for attributes which are just projected. In the example salary is not used in conditions but having a composite (dno,sal) will save us from accessing the secondary device and just get away with the index files See for more examples from pages 472 an 473 in Gehrke
Data base Tuning Query Rewriting SELECT MIN (E.age) FROM Employee E GROUP BY E.Dno HAVING E.Dno = 102 SELECT MIN (E.age) FROM Employee E WHERE E.Dno = 102 SELECT * INTO TEMP FROM Employee E, Department D WHERE E.Dno = D.dno AND D.mgrname = ‘robby’ SELECT T.Dno, Avg(T.Sal) FROM TEMP T GROUP BY T.Dno SELECT E.Dno, Avg(E.Sal) FROM Employee E, Department D WHERE E.Dno = D.dno AND D.mgrname = ‘robby’ GROUP BY E.Dno Eliminating GROUP operation avoids expensive sorting Combining steps to form a single query avoids creating an unnecessary table Temp
Impact of concurrency The duration for which transactions hold a lock can affect performance significantly. Tuning transactions by writing to local variables and deferring database access can improve performance Replacing a transaction with several smaller transactions improve performance A careful partitioning of tuples in a relation and its associated indexes across a collection of disks can improve concurrent access If DBMS uses specialized locking protocols for tree indexes and sets fine – granularity locks concurrencu improves
Performance benchmarks To assist users in choosing a DBMS that well suits their needs, several performance benchmarks have been developed. They should be portable, easy to understand, and scale naturally to larger problem instances They should measure peak performance as well as price/performance ratios Transaction processing Council was created to define benchmarks for transaction processing and database systems E.g. TCP-A TCP- B bench marks (Online transaction processing benchmarks) Wisconsin benchmark (Query benchmark) 001 and 007 benchmarks (Object Database benchmarks) Read Section 16.8 in Gerhke for Tuning Conceptual Schema