The Principle of Information Retrieval

The Principle of Information Retrieval
Department of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011

II 课程内容

2 Basic information retrieval

2.3 Tolerant retrieval Search structures for dictionaries
Wildcard queries Spelling correction Phonetic correction

2.3.1 Wildcard queries Trailing wildcard query Leading wildcard query
General wildcard query

Used in following situations1-2
The user is uncertain of the spelling of a query term e.g., Sydney vs. Sidney, which leads to the wildcard query S*dney The user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any of the variants e.g., color vs. colour Asterisk星号

Used in following situations2-2
The user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming e.g., judicial vs. judiciary, leading to the wildcard query judicia* The user is uncertain of the correct rendition of a foreign word or phrase e.g., the query Universit* Stuttgart rendition翻译

Trailing wildcard query
A query such as mon* is known as it, because * is at the end of the search string A search tree on the dictionary is a convenient way of handling it Walk down the tree following the symbols m, o and n in turn Use |W| lookups on the standard inverted index to retrieve all documents containing any term in W |W| means the number of terms in W * asterisk

Leading wildcard query
For example queries of the form *mon A reverse B-tree on the dictionary can be used For example: The term lemon would be represented by the path root-n-o-m-e-l

General wildcard query
For example queries of the form se*mon How to handle it Using regular B-tree with reverse B-tree Permuterm index K-gram index

Using regular B-tree with reverse B-tree
For example：query se*mon Use the regular B-tree to enumerate the set W of dictionary terms beginning with the prefix se Use the reverse B-tree to enumerate the set R of terms ending with the suffix mon Take the intersection W ∩ R Use the standard inverted index to retrieve all documents containing any terms in this intersection

Permuterm index1-3 First, we introduce a special symbol $ to mark the end of a term The term hello is shown as the augmented term hello$ Then construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term The set of rotated terms in the permuterm index is called the permuterm vocabulary

Permuterm index2-3 Its dictionary may be quite large, almost ten-fold space increasing

Permuterm index3-3 How to handle general wildcard query
The key is to rotate such a wildcard query so that the * symbol appears at the end of the string So the query m*n can be rotated as n$m* But what about a query such as fi*mo*er First enumerate the terms which are in the permuterm index of er$fi* Then filter these out, checking each candidate to see if it contains mo, which is exhaustive enumeration fishmonger鱼贩

K-gram index A k-gram is a sequence of k characters
If consider $, the full set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$

How to use K-gram index For example: query re*ve
Run the Boolean query $re AND ve$ Look up in the 3-gram index and yield a list of matching terms Each of these matching terms is then looked up in the standard inverted index to yield documents matching the query

Difficulties of K-gram index1-2
There is however a difficulty with the use of k-gram index Consider using the 3-gram index for the query red* First issue the Boolean query $re AND red to the 3-gram index Lead to a match on terms such as retired, which contain the conjunction of the two 3-grams $re and red, yet do not match the original wildcard query red* To cope with this, we need to introduce a post-filtering step, in which the terms enumerated are checked individually against the original query red*

Difficulties of K-gram index2-2
What is the appropriate semantics for such a query: re*d AND fe*ri All documents that contain any term matching re*d and any term matching fe*ri So the processing of a wildcard query can be quite expensive

Why search engine hides this functionality
A search engine may support such rich functionality, but most commonly, the capability is hidden behind an interface (say an “Advanced Query” interface) that most users never use Exposing such functionality in the search interface often encourages users to invoke it even when they do not require it (say, by typing a prefix of their query followed by a*), increasing the processing load on the search engine

Google's wildcard search
The wildcard search in Google only match whole word not part of word

2.3.2 Spelling correction We may wish to retrieve documents containing the term carrot when the user types the query carot

misspelling reported on Google
Based on utility of this misspelling list, we can recommend the correct spelling if users input wrong query words

http://www.google.com/jobs/britney.html 488941 britney spears
1338 britiny spears 40134 brittany spears 1211 britnet spears 36315 brittney spears 1096 britiney spears 24342 britany spears 991 britaney spears 7331 britny spears 991 britnay spears 6633 briteny spears 811 brithney spears 2696 britteny spears 811 brtiney spears 1807 briney spears 664 birtney spears 布兰妮斯皮尔斯(Britney Spears)

Google V.S. Baidu Google prefers retrieval more relevant Web pages
Baidu prefers retrieval precise relevant Web pages Google can correct English misspelling better than Baidu Baidu can correct Chinese misspelling better than Google

Basic principles of spelling correction1-2
Choose the “nearest” one This demands that we have a notion of nearness or proximity between a pair of queries For grnt, we choose not great but grant

Basic principles of spelling correction2-2
Choose the one that is more common Especially when two correctly spelled queries are tied For grnt, we choose not grunt but grant In order to judge which one is more common, we can consider the number of occurrences of the term in the collection consider the number of occurrences of the term in the queries typed in by other users, especially for Web search engine

How to expose the correct words to users1-2
On the query carot always retrieve documents containing carot as well as any “spell-corrected” version of carot, including carrot and tarot Only when the query term carot is not in the dictionary Tarot:占卜用的纸牌（共二十二张）

How to expose the correct words to users2-2
When the original query returned fewer than a threshold (say fewer than five) When the original query returns fewer than a threshold , the search interface presents a spelling suggestion to the end user. this suggestion consists of the spell-corrected query term(s) Thus, the search engine might respond to the user: “Did you mean carrot?”

Solution to misspelling
Isolated-term correction Methods based on edit distance Methods based on k-gram overlap Context-sensitive correction

Isolated-term correction
In isolated-term correction, we attempt to correct a single query term at a time But such isolated-term correction would fail to detect, for instance, that the query flew form Heathrow contains a misspelling of the term from

Edit distance1-3 Sometimes known as Levenshtein distance
Given two character strings s1 and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2 public class exec { //**************************** // Get minimum of three values private static int Minimum(int a, int b, int c) int mi; mi = a; if (b < mi) mi = b; } if (c < mi) mi = c; return mi; //***************************** // Compute Levenshtein distance public static int LD(String s, String t) int d[][]; // matrix int n; // length of s int m; // length of t int i; // iterates through s int j; // iterates through t char s_i; // ith character of s char t_j; // jth character of t int cost; // cost // Step 1 n = s.length(); m = t.length(); if (n == 0) return m; if (m == 0) return n; d = new int[n + 1][m + 1]; // Step 2 for (i = 0; i <= n; i++) d[i][0] = i; for (j = 0; j <= m; j++) d[0][j] = j; // Step 3 for (i = 1; i <= n; i++) s_i = s.charAt(i - 1); // Step 4 for (j = 1; j <= m; j++) t_j = t.charAt(j - 1); // Step 5 if (s_i == t_j) cost = 0; else cost = 1; // Step 6 d[i][j] = Minimum(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost); // Step 7 return d[n][m]; public static void main(String [] args) System.out.println(LD("cat","dog"));

Edit distance2-3 Most commonly, the edit operations allowed for this purpose are Insert a character into a string Delete a character from a string Replace a character of a string by another character For example, the edit distance between cat and dog is 3 Ref:

Edit distance3-3 In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations For instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard) But popular edit operations have the same weight

The compute of edit distance1-5
The typical cell [i, j] has four entries formatted as a 2 × 2 cell The lower right entry in each cell is the min of the other three The other three entries are the three entries m[i−1,j−1] + 0 or 1 depending on whether s1[i]=s2[j],m[i−1,j]+1 and m[i,j−1]+1

Given a set S of terms and a query string q, we seek the string having the least edit distance from q in order to get the most similar term However, this exhaustive search is inordinately expensive

The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string The hope would be that spelling errors do not occur in the first character of the query

A more sophisticated variant of this heuristic is to use a version of the permuterm index, in which we omit the end-of-word symbol $ For instance, if q is mase and we consider the rotation r = sema, we would retrieve terms such as semantic and semaphore Do not have a small edit distance to q Would miss more pertinent dictionary terms such as mare and mane For each rotation, we omit a suffix of L characters before performing the B-tree traversal ( L can be set to 2) This ensures that each term includes a long substring in common with q

K-gram indexes for spelling correction1-4
Further limit the set of vocabulary terms for which we compute edit distances to the query term In fact, we will use the k-gram index to retrieve vocabulary terms that have many k-grams in common with the query

For example: The bigram index shows the postings for the three bigrams in the query bord We wanted to retrieve terms that contained at least two of these three bigrams

However, terms like boardroom are not an implausible “correction” of bord We need Jaccard coefficient for measuring the overlap between the candidate term and query term

For example: For query q = bord reaches term t = boardroom The Jaccard coefficient to be 2/(8+3−2)

Context-sensitive correction
Isolated-term correction would fail to correct typographical errors such as flew form Heathrow where all three query terms are correctly spelled and the all is wrong

The simplest way is to enumerate corrections of each of the query terms, then try substitutions of each correction in the phrase fled form Heathrow flew fore Heathrow … The search returns the result having the maximal number of matching results But can be expensive if find many corrections

A heuristics method retains only the most frequent combinations in the collection or in the query logs In this example, the biword fled fore is likely to be rare compared to the biword flew from

2.3.3 Phonetic correction Misspellings arise because the user types a query that sounds like the target term

Phonetic hash Similar-sounding terms hash to the same value.
Especially applicable to searches on the names of people The idea owes its origins to work in international police departments from the early 20th century Now is mainly used to correct phonetic misspellings in proper nouns

Soundex algorithm Turn every term to be indexed into a 4-character reduced form Build an inverted index from these reduced forms to the original terms, call this the soundex index Do the same with query terms When the query calls for a soundex match, search this soundex index

The approach of soundex algorithm1-4
4-character code With the first character being a letter of the alphabet The other three being digits between 0 and 9 For example: Hermann maps to H655 The assumption of this algorithm Vowels are viewed as interchangeable in transcribing names Consonants with similar sounds are viewed as equivalent 1）在描述姓名时，元音可以互换；2）具有相似发音的辅音可以看成是等价的

Retain the first letter of the term Change all occurrences of the following letters to ’0’: ’A’, E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’ Change letters to digits as follows: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6

Repeatedly remove one out of each pair of consecutive identical digits Remove all zeros from the resulting string Pad the resulting string with trailing zeros or return the first four positions

Such rules tend to be writing system dependent Chinese names can be written in Wade-Giles or Pinyin transcription while soundex works for some of the differences in the two transcriptions 苏州：Suzhou/Soochow（Soochou）北京：Beijing/Peking 青岛：Qingdao/Tsingtao 清华：Qinghua/Tsinghwa 中华：Zhonghua/Chonghwa 长江：Changjiang River/Yangtze River Wade-Giles System 威妥玛–翟理斯式拼音法妥tuǒ翟zhái

Soundex with SimMetrics
SimMetrics is an open source java library of Similarity or Distance Metrics, e.g. Levenshtein distance, that provide float based similarity measures between String Data This open source library is hosted at

The Principle of Information Retrieval

Similar presentations

Presentation on theme: "The Principle of Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Principle of Information Retrieval

Similar presentations

Presentation on theme: "The Principle of Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback