INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 9 Terms Normalization

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎ “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Normalization Case folding Normalization to terms
Thesauri and soundex

Basic indexing pipeline
Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman 00:02:45  00:03:10 Indexer Inverted index friend roman countryman 2 4 13 16 1

Normalization to terms
We may need to “normalize” words in indexed text as well as query words into the same form We want to match U.S.A. and USA Tokens are transformed to terms which are then entered into the index A term is a (normalized) word type, which is an entry in our IR system dictionary We most commonly implicitly define equivalence classes of terms by, e.g., deleting periods to form a term U.S.A., USA  USA deleting hyphens to form a term anti-discriminatory, antidiscriminatory  antidiscriminatory 00:03:40  00:05:00 00:05:20  00:06:20

1. Normalization: other languages
Accents: e.g., French résumé vs. resume. Simple remedy remove accent but not good in case of Resume with and without accent. Cliché = Cliche (with and without accent same meaning) Important consideration: Are the users going to use accents while writing queries? Umlauts: e.g., German: Tuebingen vs. Tübingen Should be equivalent Even in languages that standard have accents, users often may not type them Often best to normalize to a de-accented term Tuebingen, Tübingen, Tubingen  Tubingen 00:08:37  00:09:00 00:09:30  00:11:00 00:13:40  00:14:30 00:14:35  00:15:50

Normalization: other languages
Normalization of things like date forms 7月30日 vs. 7/30 (date or mathematical expression)  Diversification: 7/30 = 7/30, July 30, 7-30 Japanese use of kana vs. Chinese characters In Japanese there are several different character sets and normalization needs to take care of this fact, and it should be able to resolve the query entered using any char-set In German MIT is a word, so how to differentiate the usage if it is University of the word MIT? Tokenization and normalization may depend on the language and so is intertwined with language detection Same method of normalization should be used while indexing as well as while query processing 00:17:39  00:19:00 00:19:50  00:21:00 00:21:50  00:23:45 Is this German “mit”? Morgen will ich in MIT …

2. Case folding Reduce all letters to lower case
exception: upper case in mid-sentence? e.g., General Motors Fed vs. fed SAIL vs. sail Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… A word starting with a capital letter in the middle of sentence is for nouns, so case folding may be given importance in this case. However, if users are not going to use capital letters then there is no point in improving index. Longstanding Google example: [fixed in 2011…] Query C.A.T. #1 result is for “cats” not Caterpillar Inc. 00:27:10  00:29:25 00:

3. Normalization to terms
An alternative to equivalence classing is to do asymmetric expansion An example of where this may be useful Enter: window Search: window, windows Enter: windows Search: Windows, windows, window Enter: Windows Search: Windows Potentially more powerful, but less efficient Increases the size of the postings list, but give more control in query processing. 00:32:30  00:35:00 00:35:20  00:36:50 00:38:30  00:39:00 Why not the reverse?

4. Thesauri and soundex Do we handle synonyms and homonyms?
Synonym: Diff words same meanings. (Automobile / Car) Homonyms: Same words different meanings (Jaguar) (Blackberry) For homonyms postings for all variants against same index word (term). For Synonym E.g., by hand-constructed equivalence classes car = automobile color = colour We can rewrite to form equivalence-class terms When the document contains automobile, index it under car-automobile (and vice-versa) Or we can expand a query When the query contains automobile, look under car as well What about spelling mistakes? Chebichev One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics Groups the words that sound similar into same equivalence class. 00:39:15  00:41:00 00:44:00  00:45:24

Resources MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http// H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations

Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations

Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

Similar presentations

About project

Feedback