Presentation is loading. Please wait.

Presentation is loading. Please wait.

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.

Similar presentations


Presentation on theme: "| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval."— Presentation transcript:

1 | 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval

2 Tolerant retrieval: overview Methods to handle imprecise queries wildcard queries typo’s alternative spellings Building alternative indexes Finding the most similar terms

3 Wild-card queries: * mon*: find docs containing any word beginning with “mon”. *mon: find words ending in “mon”: harder. mo*n: find words that start with ‘mo’ and end with ‘n’ m*o*n: find words that start with ‘m’, end with ‘n’, and have an ‘o’ somewhere inbetween. Sec. 3.2

4 Wildcard queries Two steps in retrieval for wildcard queries: Find all terms that fall within wildcard definition Find all docs containing any of these words Three ways to do this: B-trees, permuterm index, k-gram index

5 Dictionary structures: Hash: very efficient (lookup and construction), but cannot be used to find terms that are “close” to the key Binary tree and B-tree (and tries): data structures which keep data sorted (and balanced). Efficient search, but construction is more costly. Words with same suffix are close together in the result → can be used for robust retrieval.

6 Wild-card queries: * mon*: Easy with binary tree (or B-tree) lexicon: retrieve all terms in range: mon ≤ w < moo *mon: Maintain an additional B-tree for terms backwards, retrieve all words in range: nom ≤ w < non. m*n: m*o*n: Sec. 3.2 Combine B-tree and reverse B-tree. Expensive! ?? Solution: the permuterm index

7 Permuterm index and queries Permuterm index add an end symbol: cat$ index all permuterms(in a structure like B-tree): cat$ at$c t$ca $cat Wildcard query processing: add $, rotate (if needed) until * is at the end examples: queries that can find (a.o.) cat: c*t c*at ca* ca*t *t *at permuterm form?

8 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$**X* lookup on X* X*Y lookup on Y$X* X*Y*Z Sec. 3.2.1 Query = hel*o X=hel, Y=o Lookup o$hel* ???? Exercise!

9 K-gram index k-gram index (example k=3) to each dictionary term add a start and an end symbol: $kitten$ from this string, list all trigrams kitten: $ki kit itt tte en$ make an inverted index of trigrams $ki  (kinkiten, kitchen, kitten,...) how can we find kitten?

10 An alternative: K-gram indexes Index for dictionary lookup, not for document retrieval! Posting lists point from k-gram to vocabulary terms k-gram: group of k consecutive items (context- dependent: characters, syllabes, words,..) bigram (digram), trigram, …

11 K-gram index and queries Part of 3-gram inverted index: $ki->kinkiten kitchen kitten en$->kinkiten kitchen kitten che->kitchen ink->kinkiten itt->kitten kit->kinkiten kitchen kitten Wildcard query processing $kit*en$$ki AND kit AND en$ kinkiten??? postprocessing needed!

12 Query processing At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries. Sec. 3.2

13 Spell correction When? If a query word (combination) is quite rare or not available at all in the dictionary Approach: 1.Find similar term(s) 2.Calculate their similarity to the query term 3.Choose the most frequent ones

14 Finding similar words and calculate their similarity use k-gram index of words and calculate Jaccard coefficient to find most similar ones for query term |A ∩ B| / |A U B| relative similarity size of set of elements (k-grams) in common divided by size of set of all elements SET: no duplicates!

15 Even more precise then use Levenshtein distance for more precisely selecting the terms with the least edit distance to the query term demo: http://www.miislita.com/searchito/levenshtein-edit- distance.html

16 Levenshtein distance 26-01-12 m(i-1,j-1) Minimal edit distance m(i, j-1) m(i-1,j)

17 Phonetic similarity To calculate which (English) written words are most similar in pronunciation, the SOUNDEX algorithm gives a (rather rough) measure. Demo: http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundE xConverter


Download ppt "| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval."

Similar presentations


Ads by Google