LEARNING OBJECTIVES Index files. Operations Required to Maintain an Index File. Primary keys. Secondary keys. CPSC 231 Indexing (D.H.)
Index Index is a tool for finding records in a file. It consists of a key field on which the index is searched and a reference (address or RRN) field that tells where to find the data file record associated with a particular key. CPSC 231 Indexing (D.H.)
Examples of an Index The index to a book (usually at the end of the book) provides a way to find a topic quickly. Imagine a book without an index? The index in a library (an on-line catalog) allows you to locate items by an author, by a title, or by a call number. CPSC 231 Indexing (D.H.)
Index in Databases -example Musical recording store uses an index file to keep track of its inventory. The data file consists of the following fields in each record: Id number Title Composer or composers Artist or artists Label (publisher) CPSC 231 Indexing (D.H.)
recording.h class Recording // a recording with a composite key {public: Recording (); Recording (char * label, char * idNum, char * title, char * composer, char * artist); char IdNum[7]; char Title [30]; char Composer[30]; char Artist[30]; char Label[7]; char * Key () const; Unpack (IOBuffer &); int Pack (IOBuffer &) const; void Print (ostream &, char * label = 0) const; }; CPSC 231 Indexing (D.H.)
Primary key -example The primary key in our example consists of the initials for the company label combined with the product ID. The canonical form of this key will consist of the uppercase form of the Label field followed by the ASCII representation of the ID number. E.G. DG241 CPSC 231 Indexing (D.H.)
Index file Index file is used to provide rapid keyed access to individual records in the data file. Index file consists of the following fields: key (e.g. ANG3795) reference (address) =address of the corresponding record in the data file CPSC 231 Indexing (D.H.)
Operations Required to Maintain an Indexed File Create the original empty index file and data file Load index file into memory before using it (if possible, load the whole file) Rewrite the index file from memory to the permanent storage after modifying it Add data records to the data file Delete data records from the data file Update records in the data file Update the index to reflect changes in the data file CPSC 231 Indexing (D.H.)
Creating Files Create two empty files index file and data record file CPSC 231 Indexing (D.H.)
Loading Index into Main Memory This can be supported with a buffer I/O or with an array. CPSC 231 Indexing (D.H.)
Rewriting the Index File from Memory This can be supported as a part of the close operation for the index file (I.e write the buffer or the array to the disk). CPSC 231 Indexing (D.H.)
Dangers of losing the index file If the index file is: outdated corrupted or lost then there must be some means of reconstructing the index file from the data file! CPSC 231 Indexing (D.H.)
Record addition Adding a new data record to the data file requires that we add a new record to the index file too. Since the index file is usually kept sorted than adding a new record would require rearranging the records in this file. (This should be easy done if the index is kept in main memory). CPSC 231 Indexing (D.H.)
Record deletion Deleting a data record requires deletion of the corresponding index record. Note that in an index file organization all data records are pinned. (WHY?) What are the consequences of this fact? CPSC 231 Indexing (D.H.)
Record Updating There are two categories of updates: the update modifies the value of the key the update does not modify the value of the key If the update modifies (changes) the primary key, then re-ordering of the index file might be required. If the update does not change the primary key it might still require reordering of records in the data file. (WHY?) CPSC 231 Indexing (D.H.)
Indexes that are too large to hold in Memory If the index file is too large to be kept in main memory then it has to be kept on the secondary storage. There are a number of disadvantages of keeping an index file on the disk: searching the index file can be very time consuming index rearrangement can be time consuming too. CPSC 231 Indexing (D.H.)
Possible alternatives to storing index files If the index file is too large to be kept in main memory than the following alternative organizations should be considered: a hashed organization (if access speed is very important) a tree structured organization, or a multilevel index such as a B-tree CPSC 231 Indexing (D.H.)
Pros of a simple index file Even if a simple index file has to be stored on the disk, in some cases it might prove a useful method of data storing. Advantages of the simple index file: allows for use of binary search to obtain a key-access to the record if index entries are much smaller than data records then sorting and maintaining an index is much easier than the data file if the data records are pinned than the index file allows for rearranging the keys without moving the data records CPSC 231 Indexing (D.H.)
Indexing with Multiple Key Access Since the primary key is unique then it is often used as a search keyword. Example of the primary key of the class recording is Label +Id (e.g. ANG3795). But most of the time when one searches for a music CD one would rather provide a title, a composer, or an artist. CPSC 231 Indexing (D.H.)
Secondary key Secondary key is a key for which multiple records may exist in the data file. Example: The composer’s name in the Recording class example (there can be a number of CD’s with Beethoven’s work in a store). The artist name in the Recording class. CPSC 231 Indexing (D.H.)
Secondary Index File A secondary index file might be created for each of the possible secondary indexes. Each entry in the secondary index file should consists of the following two fields: secondary index field (e.g. Beethoven) the corresponding primary index key (e.g. ANG3795) CPSC 231 Indexing (D.H.)
Record Addition Adding a record to the data file implies adding a record to the secondary index file. Costs of that are similar to the cost of adding a record in the primary index file. (e.g. records might have to be shifted) CPSC 231 Indexing (D.H.)
Record Deletion Deleting a record implies removing all references to that record in the file system. After the search on the secondary key, we perform a search on the primary key of the record to be deleted and and remove it from the secondary index file. CPSC 231 Indexing (D.H.)
Record Updating There are three possible situations: The update changes secondary key (if the secondary key is changed, we may have to rearrange the secondary key index so it stays in sorted order) The update changes the primary key (it has a big impact on the primary key index but in the secondary key index we only need to update the affected primary key field) CPSC 231 Indexing (D.H.)
Record Updating Update is confined to other fields: all updates that do not affect either the primary or secondary key fields do not affect the secondary key index, even if the update is substantial. CPSC 231 Indexing (D.H.)
Retrieving Data with Multiple Secondary Keys Example: If we want to find all CDs in a music store that have Beethoven’s Symphony No. 9 then we should search data files by using the following secondary keys: composer AND title. Both of those searches should produce a list of CDs by providing their primary keys. CPSC 231 Indexing (D.H.)
Boolean AND in searches EG. The search by composer could produce the following list of CDs (ANG3795, DG139201, DG18807, RCA2626) and the search by title could produce the following list of CDs (ANG3795, COL31809, DG18807) The CDs that we are interested in will have to belong to both of the above lists. (In other words we are taking an intersection of two sets) WHY? CPSC 231 Indexing (D.H.)
Boolean OR searches If we want to find all CDs by Beethoven and Chopin then we will use OR operation in our secondary key searches. To obtain the list of CDs that we are interested we would have to combine the outcomes of both searches (or use a union of two sets) WHY? CPSC 231 Indexing (D.H.)
Cons of the Current Secondary Index Structure Index file has to be rearranged every time a new record is added to the file. If there are duplicate secondary keys, the secondary key field is repeated for each entry. CPSC 231 Indexing (D.H.)
Improvements to the secondary index key structure Solution 1 Allow for multiple primary keys to be associated with a single secondary key by allocating an array of primary keys for each secondary key entry. Solves the problem of sorting each time when an new entry is added. Suffers from internal fragmentation (WHY?), and the number of allocated entries in the array may prove too small. CPSC 231 Indexing (D.H.)
Improvements to the secondary index key structure Solution 2 Create an inverted list of indexes. Have each secondary key point to a list of primary key references associated with it. This method eliminates most of the problems associated with maintaining a secondary index file. WHY? CPSC 231 Indexing (D.H.)
Selective Index A selective index contains keys for only a portion of the records in the data file. Such an index provides the user with a view of a specific subset of the file’s records. (E.G. all CDs of Beethoven’s work produced in 1998) CPSC 231 Indexing (D.H.)
Binding Binding takes place when a key is associated with a particular physical record in the data file. This can take place either during the preparation of the data file and indexes or later on during program execution. CPSC 231 Indexing (D.H.)