Download presentation
Presentation is loading. Please wait.
Published byAbel Strickland Modified over 9 years ago
1
H.Lu/HKUST L03: Physical Database Design (I) Introduction Index Selection Partitioning & denormalization
2
L03: Physical Database Design -- 2 H.Lu/HKUST Overview After logical design, we have a “good” logical schema that describes the data to be stored and related constraints; The next step is physical database design, a process of producing a physical schema, i.e., a description of the implementation of the database on secondary storage using the target DBMS The main tasks including: choosing indexes, make clustering decisions, and to refine the conceptual and external schemas (if necessary) to meet performance goals. We must begin by understanding the workload: The most important queries and how often they arise. The most important updates and how often they arise. The desired performance for these queries and updates.
3
L03: Physical Database Design -- 3 H.Lu/HKUST Understanding the Workload For each query in the workload: Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? For each update in the workload: Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? The type of update ( INSERT/DELETE/UPDATE ), and the attributes that are affected.
4
L03: Physical Database Design -- 4 H.Lu/HKUST T-R Cross-Reference A. Enter the details of a new project and the employees assigned to it. B. Enter the details of a new employee and his/her assignment C. Change the salary for a particular job title. D. List the project number, name and the average salary of employees working on the project E. List the name of employees and the name of projects they have been assigned to. Emplyee(eno, enam, title) Pay (title, salary) Project(pno, pname, budget, loc) Assignment (eno, pno, duration)
5
L03: Physical Database Design -- 5 H.Lu/HKUST Types of Queries (I) Point query: returns at most one record (or part of a record) based on an equality selection. SELECT name FROM Employee WHERE eid=8788 Multipoint query: may return several records based on an equality selection SELECT name FROM Employee WHERE dept=Sales Range query: returns a set of records whose values lie in an interval of half interval for the selected attribute SELECT name FROM Employee WHERE salary > 15000 Prefix match query on an attribute of sequence of attributes X specifies only a prefix of X Lname = ‘Lu’ Fname LIKE ‘Ho’
6
L03: Physical Database Design -- 6 H.Lu/HKUST Types of Queries Extremal query: returns a set of records whose value on some attribute is minimum or maximum. Ordering query: Display a set of records in the order of the value of some attribute (s) Grouping query: partition the results of a query into groups Join query: multi-table queries Equality join Non-equality join
7
L03: Physical Database Design -- 7 H.Lu/HKUST Cost Estimation Each query is associated a cost for its execution. The main concern of physical database design is performance, which involves cost estimation. Cost estimation: Must estimate cost of each operation in a query execution plan tree. Depends on input cardinalities and processing method. Must estimate size of result for each operation in tree! Use information about the input relations. For selections and joins, assume independence of predicates.
8
L03: Physical Database Design -- 8 H.Lu/HKUST Statistics and Catalogs Need information about the relations and indexes involved. Catalogs typically contain at least: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree index. Catalogs updated periodically. Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. More detailed information (e.g., histograms of the values in some field) are sometimes stored.
9
L03: Physical Database Design -- 9 H.Lu/HKUST Selectivity of Predicates Consider a query block: Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. Selectivity, s, associated with each term reflects the impact of the term in reducing result size. Term col=value s = 1 / number of distinct values Term col > value s = (High -value)/(High - Low) Mutiple terms: product of each term Implicit assumption that terms are independent! SELECT attribute list FROM relation list WHERE term1 AND... AND termk
10
L03: Physical Database Design -- 10 H.Lu/HKUST Joins Theta-join : Result schema same as that of cross-product. Fewer tuples than cross-product, might be able to compute more efficiently Equi-Join: A special case of condition join where the condition c contains only equalities. Result schema similar to cross-product, but only one copy of fields for which equality is specified. Natural Join: Equijoin on all common fields.
11
L03: Physical Database Design -- 11 H.Lu/HKUST Join Selectivity Join selectivity of Estimate join selectivity R.A is a key R.A is a key, and S.B is a foreign key reference to R Neither the above two is true
12
L03: Physical Database Design -- 12 H.Lu/HKUST Decisions to Make Index selection What indexes should we create? Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? For each index, what kind of an index should it be? Clustered? Hash/tree? Dynamic/static? Dense/sparse? Should we make changes to the conceptual schema? Consider alternative normalized schemas? (Remember, there are many choices in decomposing into BCNF, etc.) Should we ``undo’’ some decomposition steps and settle for a lower normal form? (Denormalization.) Vertical/Horizontal partitioning, replication, views...
13
H.Lu/HKUST L03: Physical Database Design (I) Introduction Index Selection Partitioning & denormalization
14
L03: Physical Database Design -- 14 H.Lu/HKUST Indexes A Heap file allows us to retrieve records: by specifying the rid, or by scanning all records sequentially Sometimes, we want to retrieve records by specifying the values in one or more fields, e.g., Find all students in the “ CS ” department Find all students with a gpa > 3 Indexes are file structures that enable us to answer such value-based queries efficiently.
15
L03: Physical Database Design -- 15 H.Lu/HKUST Indexes An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k.
16
L03: Physical Database Design -- 16 H.Lu/HKUST Alternatives for Data Entry k* in Index Three alternatives: Data record with key value k Choice of alternative for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k. Examples of indexing techniques: B+ trees, hash-based structures Typically, index contains auxiliary information that directs searches to the desired data entries
17
L03: Physical Database Design -- 17 H.Lu/HKUST Alternatives for Data Entries (Contd.) Alternative 1: If this is used, index structure is a file organization for data records (like Heap files or sorted files). At most one index on a given collection of data records can use Alternative 1. (Otherwise, data records duplicated, leading to redundant storage and potential inconsistency.) If data records very large, # of pages containing data entries is high. Implies size of auxiliary information in the index is also large, typically.
18
L03: Physical Database Design -- 18 H.Lu/HKUST Alternatives for Data Entries (Contd.) Alternatives 2 and 3: Data entries typically much smaller than data records. So, better than Alternative 1 with large data records, especially if search keys are small. (Portion of index structure used to direct search is much smaller than with Alternative 1.) If more than one index is required on a given file, at most one index can use Alternative 1; rest must use Alternatives 2 or 3. Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length.
19
L03: Physical Database Design -- 19 H.Lu/HKUST Example B+-Tree Index Balanced tree Leaf nodes contain keys and pointers pointing to data pages Internal nodes contain keys and pointers: Each pointer Pi, associated to a key Ki, points to a subtree in which all key values k lie between Ki and K i+1, P 0 (P m ) points to a subtree in which all key values are less then K 0 (large than K m ), Each node must at least be half full Root
20
L03: Physical Database Design -- 20 H.Lu/HKUST B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. average fanout = 133 Typical capacities: Height 4: 133 4 = 312,900,700 records Height 3: 133 3 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes
21
L03: Physical Database Design -- 21 H.Lu/HKUST Key Length and Fanout The leaf pages contain 40 million key-pointer pairs Page size = 4K, length of pointer = 6 bytes, length of key = 4 bytes Tree level: 3 levels How about the key length is 94 bytes?
22
L03: Physical Database Design -- 22 H.Lu/HKUST Static Hashing Index # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. h(k) mod M = bucket to which data entry with key k belongs. (M = # of buckets) h(key) mod N h key Primary bucket pages Overflow pages 2 0 N-1
23
L03: Physical Database Design -- 23 H.Lu/HKUST Extensible Hashing 20* 00 01 10 11 2 2 2 2 LOCAL DEPTH 2 2 DIRECTORY GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 1* 5*21*13* 32* 16* 10* 15*7*19* 4*12* 19* 2 2 2 000 001 010 011 100 101 110 111 3 3 3 DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 32* 1*5*21*13* 16* 10* 15* 7* 4* 20* 12* LOCAL DEPTH GLOBAL DEPTH
24
L03: Physical Database Design -- 24 H.Lu/HKUST Index Classification Primary vs. secondary: If search key contains primary key, then called primary index. Unique index: Search key contains a candidate key. Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index. Alternative 1 implies clustered, but not vice-versa. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not!
25
L03: Physical Database Design -- 25 H.Lu/HKUST Clustered vs. Unclustered Index Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with some free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) direct search for (Index File) (Data file) Data Records Data entries Index entries Data entries data entries CLUSTERED Data Records UNCLUSTERED
26
L03: Physical Database Design -- 26 H.Lu/HKUST Index Classification (Contd.) Dense vs. Sparse: If there is at least one data entry per search key value (in some data record), then dense. Alternative 1 always leads to dense index. Every sparse index is clustered! Sparse indexes are smaller; however, some useful optimizations are based on dense indexes. Ashby, 25, 3000 Smith, 44, 3000 Ashby Cass Smith 22 25 30 40 44 50 Sparse Index on Name Data File Dense Index on Age 33 Bristow, 30, 2007 Basu, 33, 4003 Cass, 50, 5004 Tracy, 44, 5004 Daniels, 22, 6003 Jones, 40, 6003
27
L03: Physical Database Design -- 27 H.Lu/HKUST Index Classification (Contd.) Composite Search Keys: Search on a combination of fields. Equality query: Every field value is equal to a constant value. E.g. wrt index: age=20 and sal =75 Range query: Some field value is not a constant. E.g.: age =20; or age=20 and sal > 10 Data entries in index sorted by search key to support range queries. Lexicographic order, or Spatial order. sue1375 bob cal joe12 10 20 8011 12 nameagesal 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80,11 11 12 13 10 20 75 80 Data records sorted by name Data entries in index sorted by Data entries sorted by Examples of composite key indexes using lexicographic order.
28
L03: Physical Database Design -- 28 H.Lu/HKUST Clustered Index and Queries Clustered index can be sparse: save space and number of disk access Good for multipoint queries, i.e., equality accesses to nonunique fields – the results will be in consecutive pages. Good for equality join on an attribute with few distinct values; If both relations have clustering index on join attributes, can use merge join
29
L03: Physical Database Design -- 29 H.Lu/HKUST Non-clustered index Non-clustered index are best if they cover the query; Query can be answered using the index only good if each query retrieves significantly fewer records than there are pages in the file Significant: sequential scan is much faster than random scan Useful for point queries For multipoint queries, may or may not help
30
L03: Physical Database Design -- 30 H.Lu/HKUST Choice of Indexes One approach: consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. Before creating an index, must also consider the impact on updates in the workload! Trade-off: indexes can make queries go faster, updates slower. Require disk space, too.
31
L03: Physical Database Design -- 31 H.Lu/HKUST Issues to Consider in Index Selection Attributes mentioned in a WHERE clause are candidates for index search keys. Exact match condition suggests hash index. Range query suggests tree index. Clustering is especially useful for range queries, although it can help on equality queries as well in the presence of duplicates. Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.
32
L03: Physical Database Design -- 32 H.Lu/HKUST Issues in Index Selection (Contd.) Multi-attribute search keys should be considered when a WHERE clause contains several conditions. If range selections are involved, order of attributes should be carefully chosen to match the range ordering. Such indexes can sometimes enable index-only strategies for important queries. For index-only strategies, clustering is not important! When considering a join condition: Hash index on inner is very good for Index Nested Loops. Should be clustered if join column is not key for inner, and inner tuples need to be retrieved. Clustered B+ tree on join column(s) good for Sort-Merge.
33
L03: Physical Database Design -- 33 H.Lu/HKUST Example 1 Hash index on D.dname supports ‘Toy’ selection. Given this, index on D.dno is not needed. Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple. What if WHERE included: ``... AND E.age=25’’ ? Could retrieve Emp tuples using index on E.age, then join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index. So, if E.age index is already created, this query provides much less motivation for adding an E.dno index. SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname= ‘ Toy ’ AND E.dno=D.dno
34
L03: Physical Database Design -- 34 H.Lu/HKUST Example 2 Clearly, Emp should be the outer relation. Suggests that we build a hash index on D.dno. What index should we build on Emp? B+ tree on E.sal could be used, OR an index on E.hobby could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions. As a rule of thumb, equality selections more selective than range selections. As both examples indicate, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query. Have to understand optimizers! SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE E.sal BETWEEN 10000 AND 20000 AND E.hobby= ‘ Stamps ’ AND E.dno=D.dno
35
L03: Physical Database Design -- 35 H.Lu/HKUST Examples of Clustering B+ tree index on E.age can be used to get qualifying tuples. How selective is the condition? Is the index clustered? Consider the GROUP BY query. If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. Clustered E.dno index may be better! Equality queries and duplicates: Clustering on E.hobby helps! SELECT E.dno FROM Emp E WHERE E.age>40 SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps
36
L03: Physical Database Design -- 36 H.Lu/HKUST Clustering and Joins Clustering is especially important when accessing inner tuples in INL. Should make index on E.dno clustered. Suppose that the WHERE clause is instead: WHERE E.hobby=‘Stamps AND E.dno=D.dno If many employees collect stamps, Sort-Merge join may be worth considering. A clustered index on D.dno would help. Summary: Clustering is useful whenever many tuples are to be retrieved. SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname= ‘ Toy ’ AND E.dno=D.dno
37
L03: Physical Database Design -- 37 H.Lu/HKUST Multi-Attribute Index Keys To retrieve Emp records with age=30 AND sal=4000, an index on would be better than an index on age or an index on sal. Such indexes also called composite or concatenated indexes. Choice of index key orthogonal to clustering etc. If condition is: 20<age<30 AND 3000<sal<5000: Clustered tree index on or is best. If condition is: age=30 AND 3000<sal<5000: Clustered index much better than index! Composite indexes are larger, updated more often.
38
L03: Physical Database Design -- 38 H.Lu/HKUST Covering Index A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT D.mgr FROM Dept D, Emp E WHERE D.dno=E.dno SELECT D.mgr, E.eid FROM Dept D, Emp E WHERE D.dno=E.dno SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Tree index! Tree index! or Tree!
39
L03: Physical Database Design -- 39 H.Lu/HKUST Cost Based Index Selection SQL Server 7 and above Input: schema (with existing index), workload Output: recommendations adding/removing indexes Basic approach: First step: pick the best index for each SQL statement in the workload by enumerating relevant indexes on attributes, and computing the cost of execution plans that would use the index Second step: computes the cost of combining the indexes picked in the first step. The subset of indexes with the lowest combined cost is used for the final recommendation. Details: VLDB 02 paper
40
L03: Physical Database Design -- 40 H.Lu/HKUST Summary Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design. What are the important queries and updates? What attributes/relations are involved? Indexes support efficient retrieval of records based on the values in some fields. Index is a collection of data entries plus a way to quickly find entries with given key values. Data entries can be actual data records, pairs, or pairs. Choice orthogonal to indexing technique used to locate data entries with a given key value. Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance.
41
L03: Physical Database Design -- 41 H.Lu/HKUST Summary (Contd.) Several indexes can be built on a given file of data records, each with a different search key. Indexes must be chosen to speed up important queries (and perhaps some updates!). Index maintenance overhead on updates to key fields. Choose indexes that can help many queries, if possible. Build indexes to support index-only strategies. Clustering is an important decision; only one index on a given relation can be clustered! Order of fields in composite index key can be important. Static indexes may have to be periodically re-built. Statistics have to be periodically updated.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.