File Processing - Indexing MVNC1 Indexing Jim Skon
File Processing - Indexing MVNC2 Indexing l Index structures can greatly speed access l Consider a library card catalog »Allows quick access to books »Why not just order books by author name? l Actually three indexes: »Author »Topic »Title
File Processing - Indexing MVNC3 Indexing l Simple Index »Provides a shortcut, based on a key value, to desired. »Each index based on a certain key(s) value »Can have indexs for any key field IndexFile
File Processing - Indexing MVNC4 Indexing l Multiple Indexes »May have indexes for more then one field IndexFileIndex
File Processing - Indexing MVNC5 Indexing l Example: Record Albums »Record label »Record ID »Title »Composer(s) »Artisit(s) l Primary key: Record label + Record ID
File Processing - Indexing MVNC6 Indexing l Consider an index file which which contains records which contain: »Primary Key (Record label + Record ID) »Byte Offset l Index sorted in primary key order
File Processing - Indexing MVNC7 Operations in indexed file l Retrieving record »Search index file(perhaps using binary file) »Seek in main file to the byte offset specified in index »Read record from main file
File Processing - Indexing MVNC8 Operations in indexed file l Create the empty index and data files l Load the index file into memory l Rewrite the index file after index change l Add records to the file and index l Delete records from data file l Update records in data file
File Processing - Indexing MVNC9 Operations in indexed file l Create the empty index and data files »Create new files »Write header records indicating number of records
File Processing - Indexing MVNC10 Operations in indexed file l Load the index file into memory »Simply index index in sequential order, placing into an array of (key,offset) structures »Since the records are small, could read several records at once
File Processing - Indexing MVNC11 Operations in indexed file l Rewrite the index file after index change »Need only be done after index changes »Simply iterate through array, writing to index file »Can be done after EVERY change »Could wait until files are ready to be closed –Need to keep track of whether file version is outof date
File Processing - Indexing MVNC12 Operations in indexed file l Add records to the file and index »Add record to main file –Next free record –Maybe a linked list of “unused” records could be used to keep track of available records. –Record order of main file unimportant »Add record to index –requires moving down later records to keep file sorted –Could put at end, sorting occasionally.
File Processing - Indexing MVNC13 Operations in indexed file l Delete records from data file »Delete in main file –Mark record –Perhaps link into list of free records »Delete in index –Perhaps move every later record down one –Perhaps just mark as deleted l Could still search of key field still intact
File Processing - Indexing MVNC14 Operations in indexed file l Update records in data file »If change involves key field –Will need to move entry in index –Can be thought of as a delete followed by an insert »If change does not change key field –Case one - record does not move l just rewrite record l index unchanged –Case two - record changes position l Perhaps the record in variable size, and it grows l Index will have to changed to reflect new position l Position of reference in index unchanged
File Processing - Indexing MVNC15 Indexes too large to keep in memory l Searching »Binary searching requires several reads »Not much better then searching a sorted complete file l Updating »Indexing update can require rewritting much of the file »Orders of magnitude more expensive then in memory index management
File Processing - Indexing MVNC16 Indexes too large to keep in memory l In such cases consider »A hash file system »A tree-structured index (i.e. B-tree) l However, a file based index still has benefits »Allows binary searching on unordered file »Allows binary searching on variable length records »Indexes are smaller then main files, so somewhat cheaper to manipulate »Allows file “rearrangement” without moving actual records. (Consider when pinned)
File Processing - Indexing MVNC17 Indexing with multiple keys l Consider an additional index for access to album file by composer l Secondary index: fields »Composer »Offset into main file l Problem »Every time record moved in main file, ALL indexes must change »The indexes pin the records!
File Processing - Indexing MVNC18 Indexing with multiple keys l Secondary index pinning - solution »Refer to primary kay rather then offset to actual record »Now secondary key index doesn’t reference actual records, records not pinned. »Main file can be reorganized without changing secondary index
File Processing - Indexing MVNC19 Indexing with multiple keys l searching by secondary index »Search secondary index (binary search?) »If found, use associated primary key to look up record in primary index »Use offset in primary index to lookup actual record l remember - the secondary key may contain multiple matches (E.g. Beethoven) »A secondary key can be thought of a refering to a subset of records
File Processing - Indexing MVNC20 Indexing with multiple keys l Adding new records »Add record in main file and primary index as before »Add entry in primary in index »Add entry in secondary file –As before, shift data as needed. –Duplicate keyed index entry stored together. –Duplicate’s should be stored in primary key order
File Processing - Indexing MVNC21 Indexing with multiple keys l Deleting records »remove entry from all secondary indexes –Costly if many secondary indexes »simply leave in secondary indexes –search in primary index will fail, indicating record not available –Failed searches longer, but file management simpler (faster)
File Processing - Indexing MVNC22 Indexing with multiple keys l Updating records »The fact that secondary indexes refer to primary key insolates secondary indexes from most updates –Records can move in main file without effecting secondary index »Change in secondary key –If a secondary key value changes, then we must change the key value in secondary index, requiring secondary index reordering –Orther secondary indexes unchanged
File Processing - Indexing MVNC23 Indexing with multiple keys l Updating records »Change of primary key value –All secondary indexes must be updated to refer to the new key value –Since the secondary key is uncanged, no reorganization required in secondary indexes - just rewrite index entries in same spot –Usually one index entry needs updating per secondary index. –The main record itself will simplifying looking up associated reference in secondary index!
File Processing - Indexing MVNC24 Retrieval using combinations of secondary keys l Consider: »Find all records with ID COL3345 »Find all records of Beethoven’s work »Find all records of “Violin Concerto” l All require single index!
File Processing - Indexing MVNC25 Retrieval using combinations of secondary keys l Now consider: »Find all records with composer = “Beethoven” and title = “Symphony No. 9”. l Method one: »Search composer index for those matching Beethoven. This yields a list of primary keys. »Next search title index for those matching “Symphony No. 9”. This also yields a list of primary keys. »Now intersect the two primary key lists. This is a list of primary keys for record which match the query.
File Processing - Indexing MVNC26 Retrieval using combinations of secondary keys l General Strategies »and queries: Intersect primary keys lists »or queries: Union primary keys lists l Point: Complex queries can be performed accessing only the matching records!
File Processing - Indexing MVNC27 Secondary index problems l Consider problems with this secondary index structure: »we have to rearrange the index file every time a new record is add! –If we add anew version of Beethoven’s Symphony No. 9, we would have to add a new element to both the composer and the title indexes »If there are duplicate secondary keys, the seconary key value is stored in the secondary index once for every record with the secondary key! –Beethoven is stored in secondary index once for every Beethoven record in the main file. –Waste of space!
File Processing - Indexing MVNC28 Inverted lists l Solution one: »Increase secondary index record size to include a list of all primary keys with matching values. »Solves the two problems »Introduces problems: –records must be large enough for maximum size list –Wastes space! l This is an Inverted List
File Processing - Indexing MVNC29 Inverted lists l Solution Two: »The Bible Index is a type of an Inverted List –Works ok since never updated –If updates needed, MANY records would have to be moved
File Processing - Indexing MVNC30 Inverted lists l Solution Three: »Secondary index has: –A list of secondary keys (all unique) –Each entry contains a pointer to a list of primary key references »Now each key value stored exactly once »But how do we maintain the lists of primary key references? l Solution - linked lists!
File Processing - Indexing MVNC31 Inverted lists l Inverted lists with linked lists of references l Two data structures »A list of secondary keys, with pointers into a list of references »A list if references, each with a (next) pointer, which refers to another reference in list, or null
File Processing - Indexing MVNC32 Inverted lists l The secondary key list is no bigger then the number of distinct secondary key values »Can be often stored in RAM »Lookups - binary search l The reference list can be stored in a file »Maintained as a linked list of free records »records added by delinked from free list, and linked into the appropriate secondary key’s list. »record can be deleted by removing from the key’s link listed and linked into a free list.
File Processing - Indexing MVNC33 Selective indexes l Consider a “special” index for Christain music l The index(s) would only contain reference to albums which are considered Christain.