Information Storage & Retrieval Department of Information Management School of Information Engineering Nanjing University of Finance & Economics
II 课程内容
2 Basic information retrieval
2.3 Tolerant retrieval Search structures for dictionaries Wildcard queries Spelling correction Phonetic correction
2.3.1 Search structures for dictionaries The dictionary and has two broad classes of solutions: Hashing Search trees
The keys of dictionary The entries in the dictionary are often referred to as keys The pair of key and values often exists in data structure
Hashing Often used for dictionary lookup in some search engines Each vocabulary term ( key ) is hashed into an integer over a large enough space that hash collisions are unlikely java.security.MessageDigest
Advantage of hashing dictionary At query time, we hash each query term separately and following a pointer to the corresponding postings There is no easy way to find minor variants of a query term, which is guaranteed by Hash arithmetic The speed is so fast and easy
Disadvantage of hashing dictionary We cannot seek all terms beginning with specified prefix In Web setting where the size of the dictionary keeps growing, a hash function designed for current needs may not suffice in a few years’ time
Search trees Search trees overcome many issues of hash dictionary It should be noted that unlike hashing, search trees demand that the characters used in the document collection have a prescribed ordering
Binary tree The best-known search tree is the binary tree, each internal node represents a binary test
Binary tree
Advantage of binary tree dictionary Permit us to enumerate all vocabulary terms beginning with given prefix Efficient search (with a number of comparisons that is O(logM)) hinges on the tree being balanced
Disadvantage of binary tree dictionary The principal issue here is that of rebalancing As terms are inserted into or deleted from the binary search tree, it needs to be rebalanced
B-tree To mitigate rebalancing, one approach is to allow the number of sub-trees under an internal node to vary in a fixed interval B-tree is a search tree in which every internal node has a number of children in the interval [a, b] (a and b are positive integers)
B-tree with a = 2 and b = 4 A B-tree may be viewed as collapsing multiple levels of the binary tree into one
Advantage of B-tree dictionary Especially advantageous when some of the dictionary is disk-resident, in which case this collapsing serves the function of pre-fetching imminent binary tests In such cases, the integers a and b are determined by the sizes of disk blocks
2.3.2 Wildcard queries Trailing wildcard query Leading wildcard query General wildcard query
Used in following situations1-2 The user is uncertain of the spelling of a query term e.g., Sydney vs. Sidney, which leads to the wildcard query S*dney The user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any of the variants e.g., color vs. colour
Used in following situations2-2 The user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming e.g., judicial vs. judiciary, leading to the wildcard query judicia* The user is uncertain of the correct rendition of a foreign word or phrase e.g., the query Universit* Stuttgart
Trailing wildcard query A query such as mon* is known as it, because * is at the end of the search string A search tree on the dictionary is a convenient way of handling it Walk down the tree following the symbols m, o and n in turn Use |W| lookups on the standard inverted index to retrieve all documents containing any term in W |W| means the number of terms in W
Leading wildcard query For example queries of the form *mon A reverse B-tree on the dictionary can be used For example: The term lemon would be represented by the path root-n-o-m-e-l
General wildcard query For example queries of the form se*mon How to handle it Using regular B-tree with reverse B-tree Permuterm index K-gram index
Using regular B-tree with reverse B-tree For example : query se*mon Use the regular B-tree to enumerate the set W of dictionary terms beginning with the prefix se Use the reverse B-tree to enumerate the set R of terms ending with the suffix mon Take the intersection W ∩ Use the standard inverted index to retrieve all documents containing any terms in this intersection
Permuterm index1-3 First, we introduce a special symbol $ to mark the end of a term The term hello is shown as the augmented term hello$ Then construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term The set of rotated terms in the permuterm index is called the permuterm vocabulary
Permuterm index2-3 Its dictionary may be quite large, almost ten-fold space increasing
Permuterm index3-3 How to handle general wildcard query The key is to rotate such a wildcard query so that the * symbol appears at the end of the string So the query m*n can be rotated as n$m* But what about a query such as fi*mo*er First enumerate the terms which are in the permuterm index of er$fi* Then filter these out, checking each candidate to see if it contains mo, which is exhaustive enumeration
K-gram index A k-gram is a sequence of k characters If consider $, the full set of 3- grams generated for castle is: $ca, cas, ast, stl, tle, le$
K-gram index How does such an index help us with wildcard queries For example: query re*ve Run the Boolean query $re AND ve$ Look up in the 3-gram index and yield a list of matching terms Each of these matching terms is then looked up in the standard inverted index to yield documents matching the query
K-gram index There is however a difficulty with the use of k-gram index Consider using the 3-gram index for the query red* First issue the Boolean query $re AND red to the 3- gram index Lead to a match on terms such as retired, which contain the conjunction of the two 3-grams $re and red, yet do not match the original wildcard query red* To cope with this, we need to introduce a post-filtering step, in which the terms enumerated are checked individually against the original query red*
K-gram index What is the appropriate semantics for such a query: re*d AND fe*ri All documents that contain any term matching re*d and any term matching fe*ri So the processing of a wildcard query can be quite expensive
K-gram index A search engine may support such rich functionality, but most commonly, the capabilityis hidden behind an interface (say an “Advanced Query” interface) that most users never use Exposing such functionality in the search interface often encourages users to invoke it even when they do not require it (say, by typing a prefix of their query followed by a *), increasing the processing load on the search engine
K-gram index 注意: Google 中的通配符检索( * )只能匹配词语,而 不能匹配词语的组成部分
2.3.3 Spelling correction We may wish to retrieve documents containing the term carrot when the user types the query carot
misspelling reported on Google y.html Based on utility of this misspelling list, we can recommend the correct spelling if users input wrong query words
britney spears 1338 britiny spears brittany spears 1211 britnet spears brittney spears 1096 britiney spears britany spears 991 britaney spears 7331 britny spears 991 britnay spears 6633 briteny spears 811 brithney spears 2696 britteny spears 811 brtiney spears 1807 briney spears 664 birtney spears
Google V.S. Baidu Google prefers retrieval more relevant Web pages Baidu prefers retrieval precise relevant Web pages Google can correct English misspelling better than Baidu Baidu can correct Chinese misspelling better than Google
Basic principles of spelling correction1-2 Choose the “nearest” one This demands that we have a notion of nearness or proximity between a pair of queries For grnt, we choose not great but grant
Basic principles of spelling correction2-2 Choose the one that is more common Especially when two correctly spelled queries are tied For grnt, we choose not grunt but grant In order to judge which one is more common, we can consider the number of occurrences of the term in the collection consider the number of occurrences of the term in the queries typed in by other users, especially for Web search engine
How to expose the correct words to users1-2 On the query carot always retrieve documents containing carot as well as any “spell-corrected” version of carot, including carrot and tarot As in first above, but only when the query term carot is not in the dictionary
As in first above, but only when the original query returned fewer than a threshold (say fewer than five) When the original query returns fewer than a threshold, the search interface presents a spelling suggestion to the end user. this suggestion consists of the spell-corrected query term(s) Thus, the search engine might respond to the user: “Did you mean carrot?” How to expose the correct words to users2-2
Solution to misspelling Isolated-term correction Methods based on edit distance Methods based on k-gram overlap context-sensitive correction
Isolated-term correction In isolated-term correction, we attempt to correct a single query term at a time But such isolated-term correction would fail to detect, for instance, that the query flew form Heathrow contains a misspelling of the term from
Edit distance1-3 Sometimes known as Levenshtein distance Given two character strings s1 and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2
Edit distance2-3 Most commonly, the edit operations allowed for this purpose are Insert a character into a string Delete a character from a string Replace a character of a string by another character For example, the edit distance between cat and dog is 3 Ref:
Edit distance3-3 In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations For instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard) But popular edit operations have the same weight
The compute of edit distance1-5 The typical cell [i, j] has four entries formatted as a 2 × 2 cell The lower right entry in each cell is the min of the other three The other three entries are the three entries m[i−1,j−1] + 0 or 1 depending on whether s1[i]=s2[j],m[i−1,j]+1 and m[i,j−1]+1
The compute of edit distance2-5
The compute of edit distance3-5 Given a set S of terms and a query string q, we seek the string having the least edit distance from q in order to get the most similar term However, this exhaustive search is inordinately expensive
The compute of edit distance4-5 The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string The hope would be that spelling errors do not occur in the first character of the query A more sophisticated variant of this heuristic is to use a version of the permuterm index, in which we omit the end-of-word symbol $
The compute of edit distance5-5 For instance, if q is mase and we consider the rotation r = sema, we would retrieve terms such as semantic and semaphore Do not have a small edit distance to q Would miss more pertinent dictionary terms such as mare and mane For each rotation, we omit a suffix of L characters before performing the B- tree traversal ( L can be set to 2) This ensures that each term includes a long substring in common with q
Further limit the set of vocabulary terms for which we compute edit distances to the query term In fact, we will use the k-gram index to retrieve vocabulary terms that have many k-grams in common with the query K-gram indexes for spelling correction1-4
K-gram indexes for spelling correction2-4 For example: The bigram index shows the postings for the three bigrams in the query bord We wanted to retrieve terms that contained at least two of these three bigrams
However, terms like boardroom are not an implausible “correction” of bord We need Jaccard coefficient for measuring the overlap between the candidate term and query term K-gram indexes for spelling correction3-4
K-gram indexes for spelling correction4-4 For example: For query q = bord reaches term t = boardroom The Jaccard coefficient to be 2/(8+3−2)
Context-sensitive correction Isolated-term correction would fail to correct typographical errors such as flew form Heathrow where all three query terms are correctly spelled and the all is wrong
Context-sensitive correction The simplest way is to enumerate corrections of each of the query terms, then try substitutions of each correction in the phrase fled form Heathrow flew fore Heathrow … The search returns the result having the maximal number of matching results But can be expensive if find many corrections
Context-sensitive correction A heuristics method retains only the most frequent combinations in the collection or in the query logs In this example, the biword fled fore is likely to be rare compared to the biword flew from Only attempt to extend the list of top biwords
2.3.4 Phonetic correction Misspellings arise because the user types a query that sounds like the target term
Phonetic hash Similar-sounding terms hash to the same value. Especially applicable to searches on the names of people The idea owes its origins to work in international police departments from the early 20th century Now is mainly used to correct phonetic misspellings in proper nouns
Soundex algorithm Turn every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original terms; call this the soundex index Do the same with query terms When the query calls for a soundex match, search this soundex index
The approach of soundex algorithm1-4 4-character code With the first character being a letter of the alphabet The other three being digits between 0 and 9 For example: Hermann maps to H655 The assumption of this algorithm Vowels are viewed as interchangeable in transcribing names Consonants with similar sounds
The approach of soundex algorithm2-4 Retain the first letter of the term Change all occurrences of the following letters to ’0’: ’A’, E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’ Change letters to digits as follows: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6
The approach of soundex algorithm3-4 Repeatedly remove one out of each pair of consecutive identical digits Remove all zeros from the resulting string Pad the resulting string with trailing zeros or return the first four positions
The approach of soundex algorithm4-4 Such rules tend to be writing system dependent Chinese names can be written in Wade-Giles or Pinyin transcription while soundex works for some of the differences in the two transcriptions 苏州: Suzhou/Soochow ( Soochou ) 北京: Beijing/Peking 青岛: Qingdao/Tsingtao 清华: Qinghua/Tsinghwa 中华: Zhonghua/Chonghwa 长江: Changjiang River/Yangtze River