WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING
INTRODUCTION Searching for a basic query done via 2 options: Scanning the text sequentially = sequential or online searching = finding the occurrences of a pattern in a text when the text is not preprocessed Good when the text is small or text collection is volatile (modified frequently) or no indexing space available Build data structures over the text or indexes to speed up the search Good to build and maintain index when text collection is large and semi-static (updated at reasonably regular intervals)
INDEXING Key weight – frequency dependent, determine ranking best match tf*idf – weighting tf: key frequency in a document idf: the inverse of the number of documents containing the key
AUTOMATIC INDEXING PROCESS Text representation Recognize string Delete Stopwords Identify Stems Replace stems by identifiers Count posting Weight Use thesaurus And phrases
AUTOMATIC INDEXING PROCESS In the process: Stem identification – word normalization, NLP Short codes are used as identifiers Thesaurus – rare stems are clustered Phrases – frequent stems are combined into less frequent phrases
Nowadays, medium size databases (200 Mb) combine online and indexed searching 3 main indexing techniques Inverted files – best choice for most applications Suffix trees and arrays – faster for phrase searching but harder to build and maintain Signature files – popular in 1980’s but outperformed by inverted files Will concentrate on inverted files only
INVERTED FILE Inverted file = inverted index = word-oriented mechanism for indexing a text collection in order to speed up the searching task Composed of 2 elements – vocabulary and occurrences Vocabulary = set of all different words in the text For each word a list of all the text positions where the appears is stored Occurrences = the set of all those lists
Example A sample text and an inverted index built on it the words are converted to lower- case and some are not indexed the occurences point to character positions in the text
INVERTED FILE Positions can refer to words or characters Word positions (eg. position i refers to the i-th word) simplifies phrase and proximity queries Character positions (eg. position i refers to the i-th character) facilitates direct access to matching text positions Space required for vocabulary is small - eg. 1 Gb of the TREC-2 collection has a size of 5 Mb – can be further reduced by stemming and other techniques
INVERTED FILE Occurrences require more space – each word in the text is referenced once in the structure building an inverted index from the sample text Refer to word doc. Attached.word doc.
Searching on an inverted file Done via 3 basic steps : Vocabulary search – the words and patterns present in the query are isolated and searched in the vocabulary Retrieval of occurrences – lists of the occurrences of all the words found are retrieved Manipulation of occurrences – occurrences are processed to solve phrases, proximity or Boolean operations
TRIES * Tries or digital search trees are multiway trees that store set of strings.Every edge of the tree is labelled with a letter. To search a string in a trie, one starts at the root and scans the string characterwise, descending by the appropriate edge of the trie. This continues until a leaf is found.