Incremental Indexing Dr. Susan Gauch
Indexing Current indexing algorithms are essentially batch processing They start from scratch every time What happens if we have already indexed a million documents and add 1 document to the collection Do not want to index 1,000,001 documents from scratch Web search engines have spiders/crawlers/robots continually collecting new content Need a way to add a new document to existing inverted files
Adding a document This can cause two types of changes Add a new word Add an occurrence of an existing word
Adding a New Word This is the easiest type of change Fill in a new entry in dict file for the word Append to the end of the post file If the dict file is 1/3 full after indexing phase We can add many words before dict file blank records are used up Over time, probability of a collision increases, slowing down retrieval When dict file is > 2/3 full, rehash on disk Essentially, create new dict file twice as big Rehash all dict file records to new location Lots of I/O, but can be done in background or on separate computer
Adding an New Occurrence Change to dict file is trivial Just increment numdocs Change to post file is catastrophic Need to add a new posting record but cannot insert a new record in the middle of a file The idf for the word is different (log N/numdocs; but numdocs just changed) All postings for that word have the wrong term weights
Adding Posting Records Option 1: Blank records Write blank records after existing postings Number of blank records should be proportional to the number of existing postings E.g., if “dog” has 3 postings, write scale_factor * 3 blank records after the 3 real postings; if “many” has 100 postings, write scale_factor * 100 blank records after the 100 real ones Allows for scale_factor expansion First word to have more than numdocs * scale_factor new postings causes entire postings file to be rewritten with new blanks inserted
Adding Posting Records Option 2: Move Postings Copy existing postings for the word to end of file Append new posting there Update dict to have “start” index to new location Causes a lot of data movement Post file becomes fragmented
Adding Posting Records Option 3: Overflow pointer Change post record format to have an overflow pointer (record number/block address) Add new posting to end of post file or in separate overflow file While processing post records: Loop over numdocs records in post If overflow is null Next = i++ Else Next = overflow_location
Adding Posting Records Option 4: Next pointers Variation of Option 3 While processing post records: Seek to start Read >> docid >> wt >> next While next != -1 Seek to next Read >> docid >> wt >> next Allows infinite expandability Can degenerate into equivalent of a linked list on disk with one seek per post record
Handling idf Updating numdocs changes idf which in turn changes wt for all postings for the term Read all postings for the term, change wt, rewrite postings If doing proper document length normalization, All document lengths for this term now have new lengths Must recalculate norm factor and rewrite the postings for all terms in that document Infeasible: we don’t have a way to find all postings for a document without reading whole file or adding a new file that maps docid -> postings (doubling the inverted index size)
Better idea Calculate term weights on the fly Store rtf in posting record Prenormalized by document length Loop over postings Acc [docid} += wt Becomes Calc idf from current value of numdocs Loop over postings Acc [docid] += rtf * idf
Scalability (or how Google does it) Create overflow areas that are larger than 1 Make them variable sizes Store a few postings in dict file Dict record becomes Token, numdocs, idf, P postings, Next Pick P so that dict record is size of 0.5 or 1 block (e.g., 100) Create Small, Medium, and Large overflow files
Variable Overflows If have > P postings Allocate a record in the “Small” overflow file Record format: S postings, Next Pick S so that record fits in 1 block Or Pick S so that 50% of all tokens can be processed without going to Medium overflow file If have > P + S postings Allocate a record in “Medium” overflow file Record format: M postings, Next Or Pick M so that 90% of all tokens can be processed without going to Large overflow file
Variable Overflows If have > P + S + M postings Allocate a record in the “Large” overflow file Record format: L postings, Next Pick L so that 99% of all tokens can be processed without going to second overflow file If have > P + S + L postings Allocate another record at end of Large file Next pointer just points to the next Large record