INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
CS276A Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
Information Retrieval in Practice
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Ch 2 Term Vocabulary & Postings List
CS122B: Projects in Databases and Web Applications Winter 2017
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Document ingestion.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS276: Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recap of the previous lecture
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval Systems
PageRank GROUP 4.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 9 Terms Normalization

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Normalization Case folding Normalization to terms Thesauri and soundex

Basic indexing pipeline Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman 00:02:45  00:03:10 Indexer Inverted index friend roman countryman 2 4 13 16 1

Normalization to terms We may need to “normalize” words in indexed text as well as query words into the same form We want to match U.S.A. and USA Tokens are transformed to terms which are then entered into the index A term is a (normalized) word type, which is an entry in our IR system dictionary We most commonly implicitly define equivalence classes of terms by, e.g., deleting periods to form a term U.S.A., USA  USA deleting hyphens to form a term anti-discriminatory, antidiscriminatory  antidiscriminatory 00:03:40  00:05:00 00:05:20  00:06:20

1. Normalization: other languages Accents: e.g., French résumé vs. resume. Simple remedy remove accent but not good in case of Resume with and without accent. Cliché = Cliche (with and without accent same meaning) Important consideration: Are the users going to use accents while writing queries? Umlauts: e.g., German: Tuebingen vs. Tübingen Should be equivalent Even in languages that standard have accents, users often may not type them Often best to normalize to a de-accented term Tuebingen, Tübingen, Tubingen  Tubingen 00:08:37  00:09:00 00:09:30  00:11:00 00:13:40  00:14:30 00:14:35  00:15:50

Normalization: other languages Normalization of things like date forms 7月30日 vs. 7/30 (date or mathematical expression)  Diversification: 7/30 = 7/30, July 30, 7-30 Japanese use of kana vs. Chinese characters In Japanese there are several different character sets and normalization needs to take care of this fact, and it should be able to resolve the query entered using any char-set In German MIT is a word, so how to differentiate the usage if it is University of the word MIT? Tokenization and normalization may depend on the language and so is intertwined with language detection Same method of normalization should be used while indexing as well as while query processing 00:17:39  00:19:00 00:19:50  00:21:00 00:21:50  00:23:45 Is this German “mit”? Morgen will ich in MIT …

2. Case folding Reduce all letters to lower case exception: upper case in mid-sentence? e.g., General Motors Fed vs. fed SAIL vs. sail Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… A word starting with a capital letter in the middle of sentence is for nouns, so case folding may be given importance in this case. However, if users are not going to use capital letters then there is no point in improving index. Longstanding Google example: [fixed in 2011…] Query C.A.T. #1 result is for “cats” not Caterpillar Inc. 00:27:10  00:29:25 00:

3. Normalization to terms An alternative to equivalence classing is to do asymmetric expansion An example of where this may be useful Enter: window Search: window, windows Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows Potentially more powerful, but less efficient Increases the size of the postings list, but give more control in query processing. 00:32:30  00:35:00 00:35:20  00:36:50 00:38:30  00:39:00 Why not the reverse?

4. Thesauri and soundex Do we handle synonyms and homonyms? Synonym: Diff words same meanings. (Automobile / Car) Homonyms: Same words different meanings (Jaguar) (Blackberry) For homonyms postings for all variants against same index word (term). For Synonym E.g., by hand-constructed equivalence classes car = automobile color = colour We can rewrite to form equivalence-class terms When the document contains automobile, index it under car-automobile (and vice-versa) Or we can expand a query When the query contains automobile, look under car as well What about spelling mistakes? Chebichev One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics Groups the words that sound similar into same equivalence class. 00:39:15  00:41:00 00:44:00  00:45:24

Resources MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. http://www.seg.rmit.edu.au/research/research.php?author=4