INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Results Summaries Spelling Correction David Kauchak cs160 Fall 2009 adapted from:
Advertisements

Lecture 4: Dictionaries and tolerant retrieval
Inverted Index Hongning Wang
CS276A Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
1 ITCS 6265 Lecture 3 Dictionaries and Tolerant retrieval.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval.
Near Duplicate Detection
1 INF 2914 Information Retrieval and Web Search Lecture 9: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Modern Information Retrieval Chapter 4 Query Languages.
Summaries and Spelling Corection David Kauchak cs458 Fall 2012 adapted from:
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 3: Dictionaries.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 3: Dictionaries and tolerant retrieval.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Skip Pointers, Dictionaries and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lectures 4-6: Skip Pointers, Dictionaries and tolerant retrieval.
Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lectures 5: Dictionaries and tolerant retrieval
Information Retrieval Christopher Manning and Prabhakar Raghavan
Query processing: optimizations
Tolerant Retrieval Review Questions
Advanced Indexing Issues
Text Indexing and Search
Dictionary data structures for the Inverted Index
Near Duplicate Detection
Modified from Stanford CS276 slides
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Dictionary data structures for the Inverted Index
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 3: Dictionaries and tolerant retrieval
Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Google) and
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 21 Spelling Correction

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Edit distance Using edit distances Weighted edit distance n-gram overlap One option – Jaccard coefficient

Edit distance Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) D(i,j) = score of best alignment from sl..si to t/..tj D(i-1,j-1), if si=tj //copy = min D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //Delete 00:00:50  00:01:10 00:01:40  00:04:00 00:04:45  00:05:03 00:05:35  00:09:15 00:11:25  00:12:50 00:13:20  00:17:25 00:17:30  00:19:25

Using edit distances Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of “correct” words Show terms you found to user as suggestions Alternatively, We can look up all possible corrections in our inverted index and return all docs … slow We can run with a single most likely correction The alternatives disempower the user, but save a round of interaction with the user 00:22:30  00:23:50 00:24:20  00:24:55(Screen search in previous lectures (google sear “tahayyur”))

Weighted edit distance As before, but the weight of an operation depends on the character(s) involved Meant to capture OCR or keyboard errors Example: m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model Requires weight matrix as input Modify dynamic programming to handle weights 00:28:15  00:29:00

Weighted edit distance Case 1 : E(m, n-1) + 1 Case 2 : E(m-1, n) + 1 Case 3: E(m-1, n-1) + S ={1 if Xm ≠ Ym 0 if Xm = Ym A B C . Z Xm,Ym 00:30:00  00:31:24

Edit distance Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) E.g., the edit distance from dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) from cat to dog is 3. Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a nice example plus an applet. 00:32:40  00:33:20

n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams Variants – weight by keyboard layout, etc. 00:34:00  00:34:40

Example with trigrams Suppose the text is november Trigrams are nov, ove, vem, emb, mbe, ber. The query is december Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap? 00:36:25  00:36:48 (nov, ove) 00:36:59  00:37:07 (dec, ece) 00:37:13  00:37:30 (so 3, & how can)

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don’t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match November: emb, mbe, ber December:emb, mbe, ber 00:39:13  00:39:35 00:40:55  00:41:34 00:44:45  00:45:40

Standard postings “merge” will enumerate … Matching trigrams Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo alone lore sloth or border lore morbid rd 00:51:00  00:53:10 ardent border card Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure.

Resources Peter Norvig: How to write a spelling corrector IIR 3, MG 4.2 Efficient spell retrieval: K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992. J. Zobel and P. Dart.  Finding approximate matches in large lexicons.  Software - practice and experience 25(3), March 1995. http://citeseer.ist.psu.edu/zobel95finding.html Mikael Tillenius: Efficient Generation and Ranking of Spelling Error Corrections. Master’s thesis at Sweden’s Royal Institute of Technology. http://citeseer.ist.psu.edu/179155.html Nice, easy reading on spell correction: Peter Norvig: How to write a spelling corrector http://norvig.com/spell-correct.html