| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Dictionaries and Tolerant Retrieval Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lecture 4: Dictionaries and tolerant retrieval
CpSc 881: Information Retrieval. 2 Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary:
Inverted Index Hongning Wang
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
CS276A Information Retrieval
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
1 ITCS 6265 Lecture 3 Dictionaries and Tolerant retrieval.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
BTrees & Bitmap Indexes
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Information Retrieval IR 5. Plan Last lecture Index construction This lecture Parametric and field searches Zones in documents Wild card queries Scoring.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 INF 2914 Information Retrieval and Web Search Lecture 9: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 3: Dictionaries.
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Storage and Retrieval Structures by Ron Peterson.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Dictionaries and Tolerant retrieval
Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 3: Dictionaries and tolerant retrieval.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Skip Pointers, Dictionaries and tolerant retrieval.
Information Retrieval On the use of the Inverted Lists.
Introduction to Information Retrieval Introduction to Information Retrieval Lectures 4-6: Skip Pointers, Dictionaries and tolerant retrieval.
Information Storage & Retrieval Department of Information Management School of Information Engineering Nanjing University of Finance & Economics
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 2 1 Oct 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lectures 5: Dictionaries and tolerant retrieval
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Information Retrieval Christopher Manning and Prabhakar Raghavan
Query processing: optimizations
Tolerant Retrieval Review Questions
Dictionary data structures for the Inverted Index
Modified from Stanford CS276 slides
Query Languages.
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Dictionary data structures for the Inverted Index
Information Organization: Clustering
Lecture 3: Dictionaries and tolerant retrieval
Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Google) and
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Dictionary data structures for the Inverted Index
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval

Tolerant retrieval: overview Methods to handle imprecise queries wildcard queries typo’s alternative spellings Building alternative indexes Finding the most similar terms

Wild-card queries: * mon*: find docs containing any word beginning with “mon”. *mon: find words ending in “mon”: harder. mo*n: find words that start with ‘mo’ and end with ‘n’ m*o*n: find words that start with ‘m’, end with ‘n’, and have an ‘o’ somewhere inbetween. Sec. 3.2

Wildcard queries Two steps in retrieval for wildcard queries: Find all terms that fall within wildcard definition Find all docs containing any of these words Three ways to do this: B-trees, permuterm index, k-gram index

Dictionary structures: Hash: very efficient (lookup and construction), but cannot be used to find terms that are “close” to the key Binary tree and B-tree (and tries): data structures which keep data sorted (and balanced). Efficient search, but construction is more costly. Words with same suffix are close together in the result → can be used for robust retrieval.

Wild-card queries: * mon*: Easy with binary tree (or B-tree) lexicon: retrieve all terms in range: mon ≤ w < moo *mon: Maintain an additional B-tree for terms backwards, retrieve all words in range: nom ≤ w < non. m*n: m*o*n: Sec. 3.2 Combine B-tree and reverse B-tree. Expensive! ?? Solution: the permuterm index

Permuterm index and queries Permuterm index add an end symbol: cat$ index all permuterms(in a structure like B-tree): cat$ at$c t$ca $cat Wildcard query processing: add $, rotate (if needed) until * is at the end examples: queries that can find (a.o.) cat: c*t c*at ca* ca*t *t *at permuterm form?

Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$**X* lookup on X* X*Y lookup on Y$X* X*Y*Z Sec Query = hel*o X=hel, Y=o Lookup o$hel* ???? Exercise!

K-gram index k-gram index (example k=3) to each dictionary term add a start and an end symbol: $kitten$ from this string, list all trigrams kitten: $ki kit itt tte en$ make an inverted index of trigrams $ki  (kinkiten, kitchen, kitten,...) how can we find kitten?

An alternative: K-gram indexes Index for dictionary lookup, not for document retrieval! Posting lists point from k-gram to vocabulary terms k-gram: group of k consecutive items (context- dependent: characters, syllabes, words,..) bigram (digram), trigram, …

K-gram index and queries Part of 3-gram inverted index: $ki->kinkiten kitchen kitten en$->kinkiten kitchen kitten che->kitchen ink->kinkiten itt->kitten kit->kinkiten kitchen kitten Wildcard query processing $kit*en$$ki AND kit AND en$ kinkiten??? postprocessing needed!

Query processing At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries. Sec. 3.2

Spell correction When? If a query word (combination) is quite rare or not available at all in the dictionary Approach: 1.Find similar term(s) 2.Calculate their similarity to the query term 3.Choose the most frequent ones

Finding similar words and calculate their similarity use k-gram index of words and calculate Jaccard coefficient to find most similar ones for query term |A ∩ B| / |A U B| relative similarity size of set of elements (k-grams) in common divided by size of set of all elements SET: no duplicates!

Even more precise then use Levenshtein distance for more precisely selecting the terms with the least edit distance to the query term demo: distance.html

Levenshtein distance m(i-1,j-1) Minimal edit distance m(i, j-1) m(i-1,j)

Phonetic similarity To calculate which (English) written words are most similar in pronunciation, the SOUNDEX algorithm gives a (rather rough) measure. Demo: xConverter