The Principle of Information Retrieval

The Principle of Information Retrieval
Department of Information Management School of Information Engineering Nanjing University of Finance & Economics 2017

Chapter 4：Index

1 Inverted Index

2 Equivalence Classing Index
U.S.A. USA USA

Asymmetric scheme Windows Windows windows Windows, windows, window
window window, windows

3 Positional postings and phrase queries
Biword indexes Phrase indexes Positional indexes Combination schemes

Introduction As many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes To be able to support such queries, it is no longer sufficient for traditional postings lists

3.1 Biword indexes

Example For example the text “Nanjing Normal University”
It would generate the biwords: Nanjing Normal Normal University

Example Longer phrases can be processed by breaking them down
For example the text “stanford university palo alto” It would generate the biwords: “stanford university” AND “university palo” AND “palo alto”

Challenge This query could be expected to work fairly well in practice
But there can and will be occasional false positives 误报（False-Positive）漏报（False-Negative） Because related nouns can often be divided from each other by various function words

Biword indexes with part-of-speech-tagging
Tokenize the text and perform part-of-speech- tagging(词语标注) Group terms into nouns, including proper nouns (N) and function words, including articles and prepositions (X), among other classes Deem any string of terms of the form NX*N to be an extended biword Each such extended biword is made a term in the vocabulary Article 冠词 Preposition 前置词

For example: renegotiation of the constitution N X X N 宪法重新协商

Many fairly accurate (c. 96% per-tag accuracy) part-of-speech-taggers now exist, usually trained by machine learning methods on hand- tagged text By Manning and Schütze,1999

phrase index The concept of a biword index can be extended to longer sequences of words If the index includes variable length word sequences, it is generally referred to as a phrase index While there is always a chance of false positive matches, the chance of a false positive match on indexed phrases of length 3 or more becomes very small indeed

The disadvantage of phrase index
Storing longer phrases has the potential to greatly expand the vocabulary size Using a partial phrase index in a compound indexing scheme is a better idea

3.2 Positional indexes For each term in the vocabulary, we store docID and its postings Posting 设置标杆

Example Suppose the query is “to be or not to be”

Example In the above lists, the pattern of occurrences that is a possible match is: to: { …;4: { …, 429, 433}; …} be: { …;4: { …, 430, 434}; …}

k word proximity search
The same general method can be applied for within k word proximity searches For example: employment /3 place /k means “within k words of (on either side)” Clearly, positional indexes can be used for such queries biword indexes cannot

The disadvantage of positional indexes
Adopting a positional index expands required postings storage significantly A positional index can be 2 to 4 times as large as a non-positional index Even a compressed positional index can be about one third to one half the size of the raw text

3.3 Combination schemes If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional postings lists A combination strategy uses a phrase index, or just a biword index, for certain queries and uses a positional index for other phrase queries Michael Jackson 迈克尔·杰克逊

Combination strategy Good terms to include in the phrase index are ones known to be common based on recent querying behavior The most valuable phrase are ones where the individual words are common but the desired phrase is comparatively rare “The Who” should be indexed, whereas ”information system” should not

4 Index for Tolerant retrieval

4.1 Wildcard queries Trailing wildcard query Leading wildcard query
General wildcard query

Used in following situations
The user is uncertain of the spelling of a query term e.g., Sydney vs. Sidney, which leads to the wildcard query S*dney The user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any of the variants e.g., color vs. colour Asterisk星号

Trailing wildcard query
A query such as mon* is known as it, because * is at the end of the search string A search tree on the dictionary is a convenient way of handling it Walk down the tree following the symbols m, o and n in turn Use lookups on the standard inverted index to retrieve all documents containing any term in W * asterisk

Leading wildcard query
For example queries of the form *mon A reverse B-tree on the dictionary can be used For example: The term lemon would be represented by the path root-n- o-m-e-l

General wildcard query
For example queries of the form se*mon How to handle it Using regular B-tree with reverse B-tree Permuterm index K-gram index

Using regular B-tree with reverse B-tree
For example：query se*mon Use the regular B-tree to enumerate the set W of dictionary terms beginning with the prefix se Use the reverse B-tree to enumerate the set R of terms ending with the suffix mon Take the intersection W ∩ R Use the standard inverted index to retrieve all documents containing any terms in this intersection

Permuterm index First, we introduce a special symbol $ to mark the end of a term The term hello is shown as the augmented term hello$ Then construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term The set of rotated terms in the permuterm index is called the permuterm vocabulary

Permuterm index Its dictionary may be quite large, almost ten- fold space increasing

Permuterm index How to handle general wildcard query
The key is to rotate such a wildcard query so that the * symbol appears at the end of the string So the query h*o can be rotated as o$h* But what about a query such as fi*mo*er First enumerate the terms which are in the permuterm index of er$fi* Then filter these out, checking each candidate to see if it contains mo, which is exhaustive enumeration fishmonger鱼贩

K-gram index A k-gram is a sequence of k characters
If consider $, the full set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$

How to use K-gram index For example: query re*ve
Run the Boolean query $re AND ve$ Look up in the 3-gram index and yield a list of matching terms Each of these matching terms is then looked up in the standard inverted index to yield documents matching the query

Difficulties of K-gram index
Consider using the 3-gram index for the query red* First issue the Boolean query $re AND red to the 3-gram index Lead to a match on terms such as retired, which contain the conjunction of the two 3-grams $re and red, yet do not match the original wildcard query red* To cope with this, we need to introduce a post- filtering step, in which the terms enumerated are checked individually against the original query red*

Why search engine hides this functionality
A search engine may support such rich functionality, but most commonly, the capability is hidden behind an interface (say an “Advanced Query” interface) that most users never use Exposing such functionality in the search interface often encourages users to invoke it even when they do not require it (say, by typing a prefix of their query followed by a*), increasing the processing load on the search engine

Google's wildcard search
The wildcard search in Google only match whole word not part of word

4.2 Spelling correction We may wish to retrieve documents containing the term carrot when the user types the query carot

misspelling reported on Google
Based on utility of this misspelling list, we can recommend the correct spelling if users input wrong query words

http://www.google.com/jobs/britney.html 488941 britney spears
1338 britiny spears 40134 brittany spears 1211 britnet spears 36315 brittney spears 1096 britiney spears 24342 britany spears 991 britaney spears 7331 britny spears 991 britnay spears 6633 briteny spears 811 brithney spears 2696 britteny spears 811 brtiney spears 1807 briney spears 664 birtney spears 布兰妮斯皮尔斯(Britney Spears)

Of course, it doesn't always
Google V.S. Baidu Google prefers retrieval more relevant Web pages Baidu prefers retrieval precise relevant Web pages Google can correct English misspelling better than Baidu Baidu can correct Chinese misspelling better than Google Of course, it doesn't always look right!

Basic principles of spelling correction
Choose the “nearest” one This demands that we have a notion of nearness or proximity between a pair of queries For grnt, we choose not great but grant

Basic principles of spelling correction
Choose the one that is more common Especially when two correctly spelled queries are tied For grnt, we choose not grunt but grant In order to judge which one is more common, we can consider the number of occurrences of the term in the collection consider the number of occurrences of the term in the queries typed in by other users, especially for Web search engine

How to expose the correct words to users
On the query carot always retrieve documents containing carot as well as any “spell-corrected” version of carot, including carrot and tarot Only when the query term carot is not in the dictionary Tarot:占卜用的纸牌（共二十二张）

How to expose the correct words to users
When the original query returned fewer than a threshold (say fewer than five) When the original query returns fewer than a threshold , the search interface presents a spelling suggestion to the end user. this suggestion consists of the spell-corrected query term(s) Thus, the search engine might respond to the user: “Did you mean carrot?”

Solution to misspelling
Isolated-term correction Methods based on edit distance Methods based on k-gram overlap Context-sensitive correction

Isolated-term correction
In isolated-term correction, we attempt to correct a single query term at a time But such isolated-term correction would fail to detect, for instance, that the query flew form Heathrow contains a misspelling of the term from

Edit distance Sometimes known as Levenshtein distance
Given two character strings s1 and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2 public class exec { //**************************** // Get minimum of three values private static int Minimum(int a, int b, int c) int mi; mi = a; if (b < mi) mi = b; } if (c < mi) mi = c; return mi; //***************************** // Compute Levenshtein distance public static int LD(String s, String t) int d[][]; // matrix int n; // length of s int m; // length of t int i; // iterates through s int j; // iterates through t char s_i; // ith character of s char t_j; // jth character of t int cost; // cost // Step 1 n = s.length(); m = t.length(); if (n == 0) return m; if (m == 0) return n; d = new int[n + 1][m + 1]; // Step 2 for (i = 0; i <= n; i++) d[i][0] = i; for (j = 0; j <= m; j++) d[0][j] = j; // Step 3 for (i = 1; i <= n; i++) s_i = s.charAt(i - 1); // Step 4 for (j = 1; j <= m; j++) t_j = t.charAt(j - 1); // Step 5 if (s_i == t_j) cost = 0; else cost = 1; // Step 6 d[i][j] = Minimum(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost); // Step 7 return d[n][m]; public static void main(String [] args) System.out.println(LD("cat","dog"));

Edit distance Most commonly, the edit operations allowed for this purpose are Insert a character into a string Delete a character from a string Replace a character of a string by another character For example, the edit distance between cat and dog is 3 Ref:

Edit distance In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations For instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard) But popular edit operations have the same weight

The compute of edit distance
The typical cell [i, j] has four entries formatted as a 2 × 2 cell The lower right entry in each cell is the min of the other three The other three entries are the three entries m[i−1,j−1] + 0 or 1 depending on whether s1[i]=s2[j],m[i−1,j]+1 and m[i,j−1]+1

Given a set S of terms and a query string q, we seek the string having the least edit distance from q in order to get the most similar term However, this exhaustive search is inordinately expensive

The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string The hope would be that spelling errors do not occur in the first character of the query

A more sophisticated variant of this heuristic is to use a version of the permuterm index, in which we omit the end-of-word symbol $ For instance, if q is mase and we consider the rotation r = sema, we would retrieve terms such as semantic and semaphore Do not have a small edit distance to q Would miss more pertinent dictionary terms such as mare and mane For each rotation, we omit a suffix of L characters before performing the B-tree traversal ( L can be set to 2) This ensures that each term includes a long substring in common with q

K-gram indexes for spelling correction
Further limit the set of vocabulary terms for which we compute edit distances to the query term In fact, we will use the k-gram index to retrieve vocabulary terms that have many k-grams in common with the query

For example: The bigram index shows the postings for the three bigrams in the query bord We wanted to retrieve terms that contained at least two of these three bigrams

However, terms like boardroom are not an implausible “correction” of bord We need Jaccard coefficient for measuring the overlap between the candidate term and query term

For example: For query q = bord reaches term t = boardroom The Jaccard coefficient to be 2/(8+3−2)

Context-sensitive correction
Isolated-term correction would fail to correct typographical errors such as flew form Heathrow where all three query terms are correctly spelled and the all is wrong

The simplest way is to enumerate corrections of each of the query terms, then try substitutions of each correction in the phrase fled form Heathrow flew fore Heathrow … The search returns the result having the maximal number of matching results But can be expensive if find many corrections

A heuristics method retains only the most frequent combinations in the collection or in the query logs In this example, the biword fled fore is likely to be rare compared to the biword flew from

4.3 Phonetic correction Misspellings arise because the user types a query that sounds like the target term

Phonetic hash Similar-sounding terms hash to the same value
Especially applicable to searches on the names of people The idea owes its origins to work in international police departments from the early 20th century Now is mainly used to correct phonetic misspellings in proper nouns

Soundex algorithm Turn every term to be indexed into a 4- character reduced form Build an inverted index from these reduced forms to the original terms, call this the soundex index Do the same with query terms When the query calls for a soundex match, search this soundex index

The approach of Soundex algorithm
4-character code With the first character being a letter of the alphabet The other three being digits between 0 and 9 For example: Hermann maps to H655 The assumption of this algorithm Vowels are viewed as interchangeable in transcribing names Consonants with similar sounds are viewed as equivalent 1）在描述姓名时，元音可以互换；2）具有相似发音的辅音可以看成是等价的

Retain the first letter of the term Change all occurrences of the following letters to ’0’: ’A’, E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’ Change letters to digits as follows: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6

Repeatedly remove one out of each pair of consecutive identical digits Remove all zeros from the resulting string Pad the resulting string with trailing zeros or return the first four positions

Such rules tend to be writing system dependent Chinese names can be written in Wade-Giles or Pinyin transcription while Soundex works for some of the differences in the two transcriptions 苏州：Suzhou/Soochow（Soochou）北京：Beijing/Peking 青岛：Qingdao/Tsingtao 清华：Qinghua/Tsinghwa 中华：Zhonghua/Chonghwa 长江：Changjiang River/Yangtze River Wade-Giles System 威妥玛–翟理斯式拼音法妥tuǒ翟zhái

Soundex with SimMetrics
SimMetrics is an open source java library of Similarity or Distance Metrics, e.g. Levenshtein distance, that provide float based similarity measures between String Data This open source library is hosted at

The Principle of Information Retrieval

Similar presentations

Presentation on theme: "The Principle of Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Principle of Information Retrieval

Similar presentations

Presentation on theme: "The Principle of Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback