Indexes CSE2132 Database Systems Week 11 Lecture Indexes
Indexes SELECT *FROM EMPLOYEE WHERE EMP_ID = 'E9' Assuming EMP_ID is unique we expect to retrieve 1 row. How many records did we have to access in order to retrieve that 1 row? P1 E1 - Jones E2 - Smith E3 - Wong P2 E4 - White E5 - Bloggs P3 E7 - Chen E9 - Green Indexes Consider
Indexes Indexes an Overview The minimum amount of data transfer between Secondary Storage and Main Memory is 1 page. Therefore the cost of accessing Emp_Id = 'E9' is measured using the number of pages we had to access to retrieve the record we required. In the case of an unordered file the number of accesses using a linear search will average n/2 - where n = the number of pages. n = 3 BA (Blocks Accessed) = n/2 = 3/2 = 2 accesses (1.5) The minimum possible number of data base accesses will be equal to the number of rows retrieved (or less if the rows are in the same block/page). i.e 1 row = 1 page (block) accessed
Indexes The use of index may aid in this BUT indexes have their own overheads Indexes may use up to 50% of the allocated file space of the data base How can we reduce the cost of index use and the space used by indexes ? 1. Use efficient file access methods. e.g. If we have an ordered file we can use a binary search rather than just a linear search. 2. Be wise when choosing to create an index. Indexes an Overview
Indexes r=30,000 records blocksize=1,024 bytes (or page size) rec length=100 bytes BF = Blocking Factor = 1,024/100 = 10 Number of Blocks = r/BF = 30,000/10 = 3,000 blocks (or pages) Using a Linear Search BA = n/2 = 3,000/2 = 1,500 (on average) Using a Binary Search (if the file is ordered) BA=log n BA=log 3,000=12 1 record still need 12 accesses (This is a maximum) An Example 2 2
Indexes BA = 1 - depending on the blocking factor and the fill factor 1 record retrieved = 1 access - optimal 10 records retrieved = 10 accesses But what if they are 10 consecutive rows? BF = 10 BA = 1 or 2 if the records are stored consecutively rather than 10 required for a hashed file organization Hashing - The Number of Accesses
Indexes Indexes rely on a key value for access. If we do not have a key we must sequentially search the file. An index is an auxiliary file that makes it more efficient to search for a record in the data file. An Index is called an ACCESS PATH on a field. An example of a simple index structure is ordered by the field value. The Situation When Using Indexes
Indexes The pages can be moved around and the index address will still be correct providing the address contained in the directory is updated The pointers provide direct access to the data records Optimally there should be very few accesses to traverse the index file as all of the accesses are overheads BA = BA (index) + 1 To retrieve the data record Index Operation
Indexes Example 1 from Elmasri & Navathe (p108) Data file 3000 blocks Binary Search BA=log =12 Using an Index to retrieve a data record the number of accesses is :- BAtotal = BAi+BAd = (log n)+1 Size of index record - assume key = 9 bytes eg. SSN V+P V - length of key P - length of pointer to data value = = 15 bytes Blocking Factor BF i =Blocksize / Indexrecordsize = 1024/15 = 68 records per block Number of Index Blocks = 30,000/68 = 442 One Way Indexes Reduce Accesses 2
Indexes Total Accesses are Then BA i+d = (log 2 442) + 1 = = 10 accesses Which is less than the 12 accesses using the ordered file alone. Nb: A non-ordered data file is assumed and a dense index.
Indexes The data file may be ordered and therefore our index may be sparsely populated, i.e. only 1 index entry per block of data records - therefore less index records Another Way Indexes Reduce Accesses Our index would only need to contain 3 entries instead of 9 for the dense index - the number of index entries depends on the blocking factor of the data file. An unordered data file requires 30,000 index entries with one index entry per data record. The ordered file requires only one index entry per block i.e index entries. P1P2P3 E1 Jones E2 Smith E3 Wong E4 White E5 Bloggs E6 West E7 Chen E8 Brown E9 Green
Indexes index records/68 = 45 index blocks : ordered data file Total Accesses are Then NOTE: Index files may be independent of the data file The index can be created and dropped independently of the data file The index file can be of any file organization(in Oracle it is a B+Tree) There can be any number of index files for a data file BA = (log 2 BA = = 6 accesses for an indexed ordered file 45) + 1
Indexes The attributes on which the index is/are built is/are called the INDEXED FIELD/S - if these attributes are built on the PRIMARY KEY then this index is called the PRIMARY INDEX - if an index is built on any other attributes it is called a SECONDARY INDEX and the attribute values may be non- unique Index Terminology A SIMPLE CLUSTERING INDEX (This is not an option in Oracle) - the index is built on a non-unique attribute and includes one index entry for each distinct value of the attribute. The index entry points to the first data block that contains records with that attribute value. The underlying file must be ordered on the chosen non-unique attribute.
Indexes INDEX FILE DATA FILE ENO ENAME SALARY BIRTDATE PRIMARY KEY VALUE BLOCK POINTER A Primary Index on an Ordered File This assumes ENO is unique and is being used as a primary key. It is a non dense index as the data file is ordered on the index key. (ISAM)
Indexes INDEX FILE DATA FILE ENAME DNO EMPNO SALARY INDEX FIELD VALUE BLOCK POINTER..... A Dense Secondary Index on Non Data File Ordering Column This assumes ENAME has been nominated as an alternative key so an index entry is required for every record.The data file has been ordered on empno Aaron Abbot Adams Akers Alexander Alfred Allen Anderson Aaron Adams Akers Allen Abbot
Indexes An Clustering Index on a Non_Key Column INDEX FILE DATA FILE DNO NAME EMPNO SALARY CLUSTERING FIELD VALUE BLOCK POINTER..... It has been decided to order the data file on the department number. This is sometimes termed a clustering index but it is different to a cluster in Oracle
Indexes Consider Which Type of Index ? SELECT * FROM EMP WHERE E# = ‘E1’ SELECT * FROM EMP WHERE DEPT = ‘D1’ SELECT * FROM EMP WHERE ENAME = ‘ABLE’ SELECT * FROM EMP WHERE ENAME LIKE ‘A%’ SELECT * FROM EMP WHERE DEPT = ‘D2’ AND ENAME > ‘CAIN’ E# ENAME DEPT E1 E2 E3 E4 ABLE ADAMS BLAKE BROCK D1 D2
Indexes because a single level index is an ordered file we can create a non-dense index to an index i.e A Second Level Index We can repeat this process of leveling until the highest level of our index can fit into main memory - probably 1 page in size - this will reduce the number of I/O's in traversing the index by 1 This concept of a number of index levels decreasing in breadth is a tree structure Multi Level Indexes
Indexes Tree structures have certain properties we can take advantage of :- 1. The number of pointers at one level will determine the number of values stored at the next level We can predict the number of levels and therefore the number of accesses required for our data file. If a node has p data fields it has p + 1 pointers to p + 1 nodes each of which has p values i.e A node which can fit 2 data values per index node has 3 pointers to three nodes each with two data values and 3 pointers which can point to 18 data values. Indexes as Tree structures
Indexes <=<= >> <=<= >> <=<= >> <=<= >> A 3 level index with 2 data values per node could point to 18 data values. Indexes as Tree structures
Indexes If a tree structure is balanced and self-maintaining then the number of accesses to the data file is constant The number of accesses using a multi-level index is :- Indexes as Tree structures - where b is the branching factor and n is the number of pages NOTE : The number of accesses will decrease as n becomes smaller (i.e use a non-dense index) and as the branching factor b increases BA = (log b n) + 1
Indexes (Example 3. Elmasri and Navathe Ch 5) V = 9 bytes + a 6 byte pointer thus Index entry Ri = 15 bytes The blocking factor = 1024/15 = 68 This is known as the fanout for a multi-level index The number of first level blocks = 30,000 / 68 = 442 blocks The number of second level blocks = 442 / 68 = 7 blocks The number of third level blocks = 7 / 68 = 1 block Indexes as Tree structures BA = T(number of levels) + 1 = = 4 or BA = (log ) + 1 = = 4 Index size = b1 + b2 + b3 = = 450 blocks
Indexes SELECT * FROM EMP WHERE E# = 'E4' Total Number of Accesses = Index Accesses+1 Data Access = = 2 Average Number of Serial Accesses for a Table with 3 pages is n/2 = 2 Index Usage An index is used to optimize data retrieval KeyPage # E1 E20 E P1 P2 P3 E1 E2 E10 E20 E30 E40 E45 E56
Indexes p. 210, Hoffer et.al. 6th Edn. Ch 18 Date The Query Optimizer may decide against using an index In some cases the optimizer will use only the index to answer a query and will not access any data pages SELECT COUNT(*) FROM EMP Some Relational Products force the use of an index even though it may be inefficient to do so 1. To support a PK (in Oracle if a column is made a primary key an index is created) 2. To implement clustering The Query Optimizer
Indexes Create indexes on columns used in predicates :-. Read only and frequently accessed tables if > 3 pages. The columns of a predicate in frequently executed transactions. High update tables can also use indexes if > 6 pages. columns used in joins are candidates for indexes. columns in which aggregates are frequently calculated. use indexes on FK if using RI - will work out integrity violations or cascade quickly Avoid creating indexes on :-. attributes with a small number of unique values i.e. Gender M,F although the Oracle BITMAP index is suitable in this situation. keep indexes down to a reasonable number on high update attributes(tables) 2 or 3 if possible Good and Bad Candidates for Indexes
Indexes The above criteria are not always mutually exclusive - therefore must decide on index usage based on the most important requirements. There are tricks which can be used to enhance the use of indexes :-. place index or index level in memory. place indexes on a fast device. place indexes on a separate device from the data they reference. use multiple column indexes where possible i.e INDEX C1, C2, C4 SELECT * FROM EMP WHERE C1 = 'A' - will use both AND C2 = 'B' treats this as AND C5 = 'E' a substring SELECT * FROM EMP WHERE C1 = 'A' -will use only one AND C4 = 'D' AND C5 = 'E' Other Issues for Indexes
Indexes Oracle block maybe 2048 bytes (they vary with operating system) Number of EMPS = 30, bytes overhead per index block Blocking Factor of the index or keys per index page = 2000/(key length + 6)= 2000/(9+6)= in theory the branching factor will be less than 133 as we will want to include free space b1 = number of records / 133 = 30,000/ 133 = 225 b2 = b1 / 133 = 2 b3 = b2 / 133 < 1 t = 3 levels of index data or t = log = 2.1 = 3 (= log / log ) Maximum Number of Records with a three level index with Blocking Factor as indicated = 133 x 133 x 133 = 2.5 million An Example