Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh
2 Previous Chapter: Conclusions Text transformation: meaning instead of strings oLexical analysis oStopwords oStemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression oSearchable (compress the query, then search) oRandom access oWord-based statistical methods (Huffman) Index compression
3 Previous Chapter: Research topics All computational linguistics oImproved POS tagging oImproved WSD Uses of thesaurus ofor user navigation ofor collating similar terms Better compression methods oSearchable compression oRandom access
4
5 Types of searching Sequential oSmall texts oVolatile, or space limited Indexed oSemi-static oSpace overhead First, we discuss indexed searching, then sequential
6 Inverted files Vocabulary: sqrt (n). Heaps law. 1GB 5M Occurrences: n * 40% (stopwords) opositions (word, char), files, sections...
7 Compression: Block addressing Block addressing: 5% overhead o256, 64K,..., blocks (1, 2,..., bytes) oEqual size (faster search) or logical sections (retrieval units)
8 Searching in inverted files Vocabulary search oSeparate file oMany searching techniques oLexicographic: log V (voc. size) = ½ log n (Heaps) oHashing is not good for prefix search Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) ( Heaps, Zipf ) oBoolean operations. Context search Merging One list is shorter (Zipf law) Only inverted files allow sublinear both space & time Suffix trees and signature files dont
9 Building inverted file: 1 Infinite memory? Use trie to store vocabulary oappend positions O(n)
10 Building inverted file: 2 Finite memory? Fill the memory Write partial index; n/M pieces Merge partial indices (hierarchically): n log (n/M) Insertion: index, merge. n + n'log(n'/M) Deleting: eliminate every occurrence. n Very fast creating/maintenance
11 Suffix trees Text as one long string. No words. oGenetic databases oComplex queries oCompacted trie structure oProblem: space For text retrieval, inverted files are better
12
13
14 Suffix array All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access
15 Searching. Construction Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size) Construction of arrays: sorting oLarge text: n 2 log (M)/M, more than for inverted files oSkip details Addition: n n' log (M)/M Deletion: n
16 Signature files Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all its bits are set Sequential search for blocks False drops! oDesign of the hash function oHave to traverse the block Good to search ANDs or proximity queries obit patterns are ORed
17
18 Boolean operations Merging file (occurrences) lists oAND: to find repetitions According to query syntax tree Complexity linear in intermediate results oCan be slow if they are huge There are optimization techniques oE.g.: merge small list with a big one by searching oThis is a usual case (Zipf)
19 Sequential search Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average Knuth-Morris-Pratt: linear worst, but the same avrg Boyer-Moore: n log(m) / m. Not all chars are examined ! oIf some part of the pattern was compared, no need to compare inside it: you analyze the pattern once Shift-Or: uses logical operation on all 32 bits in parallel BDM: automation. Complexity same as Boyer-Moore Combination of BDM with bit parallelism
20 Approximate string matching Match with k errors Levenshtein distance Dynamic programming: O(mn), O(kn) Automation: non-deterministic oConvert to deterministic: O(n), but huge structure oBit-parallel: O(n), the fastest known Filtering: sublinear! ok errors cannot alter k segments omultipattern exact search; detect suspicious places ouses approximate algorithm only when needed
21 Regular expressions oAutomation: O (m 2 m ) + O (n) – bad for long patterns oBit-parallel (simulates non-deterministic) Using indices to search for words with errors oInverted files: search in vocabulary, then each word oSuffix trees and Suffix arrays: the same algorithms!
22 Structural queries Ad-hoc index for structure Indexing tags as words oInverted files are good since they store occurrences in order
23 Search over compression Improves both space AND time (less disk operations) Compress query and search oHuffman compression, words as symbols, bytes (frequencies: most frequent shorter) oSearch each word in the vocabulary its code oMore sophisticated algorithms Compressed inverted files: less disk less time Text and index compression can be combined
24...compression Suffix trees can be compressed almost to size of suffix arrays Suffix arrays cant be compressed (almost random), but can be constructed over compressed text oinstead of Huffman, use a code that respects alphabetic order oalmost the same compression Signature files are sparse, so can be compressed oratios up to 70%
25
26 Research topics Perhaps, new details in integration of compression and search Linguistic indexing: allowing linguistic variations oSearch in plural or only singular oSearch with or without synonyms
27 Conclusions Inverted files seem to be the best option Other structures are good for specific cases oGenetic databases Sequential searching is an integral part of many indexing-based search techniques oMany methods to improve sequential searching Compression can be integrated with search
28 Thank you! Till compensation lecture?